The commit is pushed to "branch-rh9-5.14.vz9.1.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after ark-5.14 ------> commit edb6893b99b2d701d7e76e8eadab8729e8edd190 Author: Valeriy Vdovin <valeriy.vdo...@virtuozzo.com> Date: Mon Oct 4 20:39:01 2021 +0300
ve/fs/binfmt: virtualization * keep deference from binfmt_misc sb to ve * store pointer to binfmt_misc data in ve->binfmt_misc Here bm_put_super() can race with load_misc_binary() caller, which is working with get_exec_env()->binfmt_misc. Will be fixed separately. Signed-off-by: Konstantin Khlebnikov <khlebni...@openvz.org> +++ VE/BINFTM: fix destruction ordering kill binfmt_data together with ve_struct Signed-off-by: Konstantin Khlebnikov <khlebni...@openvz.org> +++ ve/binfmt_misc: do not use sb->s_fs_info Patchset description: zap sb->s_ns + fix memleak in binfmt_misc Vladimir Davydov (6): binfmt_misc: do not use sb->s_fs_info Revert "VE/VFS: use sb->s_ns member to store namespace for mount_ns() calls" Revert "ve/sunrpc: use correct pointer to net_namespace in auth_gss.c" Revert "nfsd/sunrpc/mqueue: use sb->s_ns instead of data in fill_super" binfmt_misc: do not use s_ns binfmt_misc: destroy all nodes on ve stop https://jira.sw.ru/browse/PSBM-39154 Reviewed-by: Cyrill Gorcunov <gorcu...@virtuozzo.com> ====================== This patch description: When we virtualized binfmt_misc, we made sb->s_fs_info store a pointer to binfmt_misc struct. At the same time, we store a pointer to the owner ve_struct in sb->s_ns and a pointer to the same binfmt_misc struct in ve_struct->binfmt_misc. That said, we don't actually need to use s_fs_info, because we can get the binfmt_misc by dereferencing sb->s_ns->binfmt_misc. Using sb->s_fs_info instead of sb->s_ns will allow us to revert our patches introducing sb->s_ns. This could be merged to 0b0dbb644794 ("VE/BINFTM: virtualization"). Signed-off-by: Vladimir Davydov <vdavy...@parallels.com> +++ ve/binfmt_misc: do not use s_ns Patchset description: zap sb->s_ns + fix memleak in binfmt_misc Vladimir Davydov (6): binfmt_misc: do not use sb->s_fs_info Revert "VE/VFS: use sb->s_ns member to store namespace for mount_ns() calls" Revert "ve/sunrpc: use correct pointer to net_namespace in auth_gss.c" Revert "nfsd/sunrpc/mqueue: use sb->s_ns instead of data in fill_super" binfmt_misc: do not use s_ns binfmt_misc: destroy all nodes on ve stop https://jira.sw.ru/browse/PSBM-39154 Reviewed-by: Cyrill Gorcunov <gorcu...@virtuozzo.com> ====================== This patch description: Since 9e7411c5c3b5 was reverted, we must use sb->s_fs_info for storing a pointer to the namespace. This could be merged to 0b0dbb644794 ("VE/BINFTM: virtualization"). Signed-off-by: Vladimir Davydov <vdavy...@parallels.com> +++ ve/fs: Use ve_printk in fs/binfmt_aout.c This is a part of 74-diff-ve-mix-combined. https://jira.sw.ru/browse/PSBM-17903 Signed-off-by: Kirill Tkhai <ktk...@parallels.com> +++ ve/fs: Allow to mount binfmt_misc under non-root ns https://jira.sw.ru/browse/PSBM-40100 v2: Check that user_ns is initial for the ve. v3: Be sure ve->init_cred is set. Signed-off-by: Kirill Tkhai <ktk...@odin.com> Acked-by: Vladimir Davydov <vdavy...@virtuozzo.com> khorenko@: in fact we allowed to do those mounts in top CT user ns only. +++ ve/binfmt_misc: Allow mount if capable(CAP_SYS_ADMIN) The patch allows to mount binfmt_misc in a CT with ve0's admin caps, and it's need that for CRIU dump. This time, unmounted binfmt_misc may be forced mounted back, and we don't want to change CRIU's user_ns to do that. https://jira.sw.ru/browse/PSBM-47737 Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com> Reviewed-by: Andrey Ryabinin <aryabi...@virtuozzo.com> +++ ve/fs/binfmt_misc: store link to ve in sb->s_fs_info https://jira.sw.ru/browse/PSBM-85685 Signed-off-by: Andrey Ryabinin <aryabi...@virtuozzo.com> +++ Overrides: ve/fs/binfmt: store link to ve in sb->s_fs_info After rebase to RHEL7.5 sb->s_fs_info by default contains a link to "ns" provided to mount_ns(), but in binfmt_misc code we need a link to ve there, so adjust bm_fill_super() accordingly. https://jira.sw.ru/browse/PSBM-85052 Signed-off-by: Konstantin Khorenko <khore...@virtuozzo.com> (cherry picked from commit bd9b1e8d6f856300df13e955340ed5a2e89d1b56 due to bug https://jira.sw.ru/browse/PSBM-103973) Signed-off-by: Valeriy Vdovin <valeriy.vdo...@virtuozzo.com> khorenko@: rebase to RHEL8.4 notes: - s/FS_USERNS_MOUNT/FS_VE_MOUNT/ and dropped extra checks in bm_mount() +++ ve/fs/binfmt: fix EBUSY on mounting second binfmt_misc in CT After rebase to RHEL 8.4 binfmt_misc fs uses new fscontext API, our implementation is incorrect. We have to use get_tree_keyed() helper instead of get_tree_single() which allows only one superblock per HN. https://jira.sw.ru/browse/PSBM-132709 Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalit...@virtuozzo.com> Reviewed-by: Pavel Tikhomirov <ptikhomi...@virtuozzo.com> +++ ve/fs/binfmt: clean bm_data reference from ve on err path 1. Make sure ve->binfmt_misc is NULL if error happens on binfmt_misc mount, otherwise on next attempt to mount binfmt_misc (probably successful) we won't even try to allocate/init structures for it. 2. The current bm_fill_super() code makes us suppose we can get into the function with ve->binfmt_misc already initialized. If this is true and simple_fill_super() fails we will free preconfigured ve->binfmt_misc without proper deinitialization (ve_binfmt_fini()). Hopefully this is a wrong assumption, so rewrite the code not to confuse readers. https://jira.sw.ru/browse/PSBM-131994 Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalit...@virtuozzo.com> Signed-off-by: Konstantin Khorenko <khore...@virtuozzo.com> (cherry picked from vz8 commit 8fc0d42d5d33e82f5238fb61561439407f208371) Signed-off-by: Andrey Zhadchenko <andrey.zhadche...@virtuozzo.com> --- fs/binfmt_aout.c | 6 +-- fs/binfmt_misc.c | 106 +++++++++++++++++++++++++++++++++++++++++++++-------- include/linux/ve.h | 4 ++ kernel/ve/ve.c | 3 ++ 4 files changed, 100 insertions(+), 19 deletions(-) diff --git a/fs/binfmt_aout.c b/fs/binfmt_aout.c index 145917f734fe..70488484dcdd 100644 --- a/fs/binfmt_aout.c +++ b/fs/binfmt_aout.c @@ -200,12 +200,12 @@ static int load_aout_binary(struct linux_binprm * bprm) if ((ex.a_text & 0xfff || ex.a_data & 0xfff) && (N_MAGIC(ex) != NMAGIC) && printk_ratelimit()) { - printk(KERN_NOTICE "executable not page aligned\n"); + ve_printk(VE_LOG, KERN_NOTICE "executable not page aligned\n"); } if ((fd_offset & ~PAGE_MASK) != 0 && printk_ratelimit()) { - printk(KERN_WARNING + ve_printk(VE_LOG, KERN_WARNING "fd_offset is not page aligned. Please convert program: %pD\n", bprm->file); } @@ -293,7 +293,7 @@ static int load_aout_library(struct file *file) if ((N_TXTOFF(ex) & ~PAGE_MASK) != 0) { if (printk_ratelimit()) { - printk(KERN_WARNING + ve_printk(VE_LOG, KERN_WARNING "N_TXTOFF is not page aligned. Please convert library: %pD\n", file); } diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c index e7f7117db9dd..da0dfe22095c 100644 --- a/fs/binfmt_misc.c +++ b/fs/binfmt_misc.c @@ -26,6 +26,8 @@ #include <linux/fs_context.h> #include <linux/syscalls.h> #include <linux/fs.h> +#include <linux/ve.h> + #include <linux/uaccess.h> #include "internal.h" @@ -70,11 +72,7 @@ struct binfmt_misc { int entry_count; }; -struct binfmt_misc binfmt_data = { - .entries = LIST_HEAD_INIT(binfmt_data.entries), - .enabled = 1, - .entries_lock = __RW_LOCK_UNLOCKED(binfmt_data.entries_lock), -}; +#define BINFMT_MISC(sb) (((struct ve_struct *)(sb)->s_fs_info)->binfmt_misc) /* * Max length of the register string. Determined by: @@ -143,7 +141,7 @@ static int load_misc_binary(struct linux_binprm *bprm) Node *fmt; struct file *interp_file = NULL; int retval; - struct binfmt_misc *bm_data = &binfmt_data; + struct binfmt_misc *bm_data = get_exec_env()->binfmt_misc; retval = -ENOEXEC; if (!bm_data || !bm_data->enabled) @@ -617,7 +615,7 @@ static ssize_t bm_entry_write(struct file *file, const char __user *buffer, Node *e = file_inode(file)->i_private; int res = parse_command(buffer, count); struct super_block *sb = file->f_path.dentry->d_sb; - struct binfmt_misc *bm_data = sb->s_fs_info; + struct binfmt_misc *bm_data = BINFMT_MISC(sb); switch (res) { case 1: @@ -659,8 +657,8 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer, Node *e; struct inode *inode; struct super_block *sb = file_inode(file)->i_sb; - struct binfmt_misc *bm_data = sb->s_fs_info; struct dentry *root = sb->s_root, *dentry; + struct binfmt_misc *bm_data = BINFMT_MISC(sb); int err = 0; struct file *f = NULL; @@ -737,7 +735,7 @@ static const struct file_operations bm_register_operations = { static ssize_t bm_status_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos) { - struct binfmt_misc *bm_data = file->f_path.dentry->d_sb->s_fs_info; + struct binfmt_misc *bm_data = BINFMT_MISC(file->f_path.dentry->d_sb); char *s = bm_data->enabled ? "enabled\n" : "disabled\n"; return simple_read_from_buffer(buf, nbytes, ppos, s, strlen(s)); @@ -746,7 +744,7 @@ bm_status_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos) static ssize_t bm_status_write(struct file *file, const char __user *buffer, size_t count, loff_t *ppos) { - struct binfmt_misc *bm_data = file->f_path.dentry->d_sb->s_fs_info; + struct binfmt_misc *bm_data = BINFMT_MISC(file->f_path.dentry->d_sb); int res = parse_command(buffer, count); struct dentry *root; @@ -785,9 +783,17 @@ static const struct file_operations bm_status_operations = { /* Superblock handling */ +static void bm_put_super(struct super_block *sb) +{ + struct binfmt_misc *bm_data = BINFMT_MISC(sb); + + bm_data->enabled = 0; +} + static const struct super_operations s_ops = { .statfs = simple_statfs, .evict_inode = bm_evict_inode, + .put_super = bm_put_super, }; static int bm_fill_super(struct super_block *sb, struct fs_context *fc) @@ -799,20 +805,79 @@ static int bm_fill_super(struct super_block *sb, struct fs_context *fc) /* last one */ {""} }; + struct ve_struct *ve = get_exec_env(); + struct binfmt_misc *bm_data; + + /* + * bm_get_tree() + * get_tree_keyed(fc, bm_fill_super, get_ve(ve)) + * fc->s_fs_info = current VE + * vfs_get_super(fc, vfs_get_keyed_super, bm_fill_super) + * sb = sget_fc(fc, test, set_anon_super_fc) + * if (!sb->s_root) { + * err = bm_fill_super(sb, fc); + * + * => we should never get here with initialized ve->binfmt_misc. + */ + if (WARN_ON_ONCE(ve->binfmt_misc)) + return -EEXIST; + + bm_data = kzalloc(sizeof(struct binfmt_misc), GFP_KERNEL); + if (!bm_data) + return -ENOMEM; + + INIT_LIST_HEAD(&bm_data->entries); + rwlock_init(&bm_data->entries_lock); + err = simple_fill_super(sb, BINFMTFS_MAGIC, bm_files); - if (!err) { - sb->s_op = &s_ops; - sb->s_fs_info = &binfmt_data; + if (err) { + kfree(bm_data); + return err; } - return err; + + sb->s_op = &s_ops; + + ve->binfmt_misc = bm_data; + bm_data->enabled = 1; + + return 0; } static int bm_get_tree(struct fs_context *fc) { - return get_tree_single(fc, bm_fill_super); + struct ve_struct *ve = get_exec_env(); + + /* + * We need one binfmt_misc superblock per VE, + * use get_tree_keyed() helper to get vfs_tree. + * + * It allows us to find sb by key (in our case ve is the key), + * and if it doesn't exists creates new. + * + * Important: we take ve refcnt here. It will be put + * in one of two places: + * 1. bm_free_fc() + * on error path (wrong mnt opt provided for instance) + * if sb exists and initialized already + * 2. bm_kill_sb() when sb refcnt becomes zero (last mount umounted) + */ + return get_tree_keyed(fc, bm_fill_super, get_ve(ve)); +} + +static void bm_free_fc(struct fs_context *fc) +{ + /* + * fc->s_fs_info will be NULL if bm_fill_super() was called and + * no error occured (it means that new sb was allocated successfuly) + * see fs/super.c sget_fc() helper + */ + if (fc->s_fs_info) + put_ve(fc->s_fs_info); } + static const struct fs_context_operations bm_context_ops = { + .free = bm_free_fc, .get_tree = bm_get_tree, }; @@ -822,6 +887,14 @@ static int bm_init_fs_context(struct fs_context *fc) return 0; } +static void bm_kill_sb(struct super_block *sb) +{ + struct ve_struct *ve = sb->s_fs_info; + + kill_litter_super(sb); + put_ve(ve); +} + static struct linux_binfmt misc_format = { .module = THIS_MODULE, .load_binary = load_misc_binary, @@ -831,7 +904,8 @@ static struct file_system_type bm_fs_type = { .owner = THIS_MODULE, .name = "binfmt_misc", .init_fs_context = bm_init_fs_context, - .kill_sb = kill_litter_super, + .kill_sb = bm_kill_sb, + .fs_flags = FS_VIRTUALIZED | FS_VE_MOUNT, }; MODULE_ALIAS_FS("binfmt_misc"); diff --git a/include/linux/ve.h b/include/linux/ve.h index 3665e8c6a853..bb4246147c0d 100644 --- a/include/linux/ve.h +++ b/include/linux/ve.h @@ -56,6 +56,10 @@ struct ve_struct { struct kstat_lat_pcpu_struct sched_lat_ve; +#if IS_ENABLED(CONFIG_BINFMT_MISC) + struct binfmt_misc *binfmt_misc; +#endif + struct kmapset_key sysfs_perms_key; atomic_t netns_avail_nr; diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c index 369df91efda4..a1888d6717f6 100644 --- a/kernel/ve/ve.c +++ b/kernel/ve/ve.c @@ -725,6 +725,9 @@ static void ve_destroy(struct cgroup_subsys_state *css) kmapset_unlink(&ve->sysfs_perms_key, &sysfs_ve_perms_set); ve_log_destroy(ve); ve_free_vdso(ve); +#if IS_ENABLED(CONFIG_BINFMT_MISC) + kfree(ve->binfmt_misc); +#endif free_percpu(ve->sched_lat_ve.cur); kmem_cache_free(ve_cachep, ve); } _______________________________________________ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel