The commit is pushed to "branch-rh9-5.14.vz9.1.x-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after ark-5.14
------>
commit edb6893b99b2d701d7e76e8eadab8729e8edd190
Author: Valeriy Vdovin <valeriy.vdo...@virtuozzo.com>
Date:   Mon Oct 4 20:39:01 2021 +0300

    ve/fs/binfmt: virtualization
    
    * keep deference from binfmt_misc sb to ve
    * store pointer to binfmt_misc data in ve->binfmt_misc
    
    Here bm_put_super() can race with load_misc_binary() caller, which is 
working
    with get_exec_env()->binfmt_misc.
    Will be fixed separately.
    
    Signed-off-by: Konstantin Khlebnikov <khlebni...@openvz.org>
    
    +++
    VE/BINFTM: fix destruction ordering
    
    kill binfmt_data together with ve_struct
    
    Signed-off-by: Konstantin Khlebnikov <khlebni...@openvz.org>
    
    +++
    ve/binfmt_misc: do not use sb->s_fs_info
    
    Patchset description:
    
    zap sb->s_ns + fix memleak in binfmt_misc
    
    Vladimir Davydov (6):
      binfmt_misc: do not use sb->s_fs_info
      Revert "VE/VFS: use sb->s_ns member to store namespace for mount_ns()
        calls"
      Revert "ve/sunrpc: use correct pointer to net_namespace in auth_gss.c"
      Revert "nfsd/sunrpc/mqueue: use sb->s_ns instead of data in
        fill_super"
      binfmt_misc: do not use s_ns
      binfmt_misc: destroy all nodes on ve stop
    
    https://jira.sw.ru/browse/PSBM-39154
    
    Reviewed-by: Cyrill Gorcunov <gorcu...@virtuozzo.com>
    
    ======================
    This patch description:
    
    When we virtualized binfmt_misc, we made sb->s_fs_info store a pointer
    to binfmt_misc struct. At the same time, we store a pointer to the owner
    ve_struct in sb->s_ns and a pointer to the same binfmt_misc struct in
    ve_struct->binfmt_misc. That said, we don't actually need to use
    s_fs_info, because we can get the binfmt_misc by dereferencing
    sb->s_ns->binfmt_misc.
    
    Using sb->s_fs_info instead of sb->s_ns will allow us to revert our
    patches introducing sb->s_ns.
    
    This could be merged to 0b0dbb644794 ("VE/BINFTM: virtualization").
    
    Signed-off-by: Vladimir Davydov <vdavy...@parallels.com>
    
    +++
    ve/binfmt_misc: do not use s_ns
    
    Patchset description:
    
    zap sb->s_ns + fix memleak in binfmt_misc
    
    Vladimir Davydov (6):
      binfmt_misc: do not use sb->s_fs_info
      Revert "VE/VFS: use sb->s_ns member to store namespace for mount_ns()
        calls"
      Revert "ve/sunrpc: use correct pointer to net_namespace in auth_gss.c"
      Revert "nfsd/sunrpc/mqueue: use sb->s_ns instead of data in
        fill_super"
      binfmt_misc: do not use s_ns
      binfmt_misc: destroy all nodes on ve stop
    
    https://jira.sw.ru/browse/PSBM-39154
    
    Reviewed-by: Cyrill Gorcunov <gorcu...@virtuozzo.com>
    
    ======================
    This patch description:
    
    Since 9e7411c5c3b5 was reverted, we must use sb->s_fs_info for storing a
    pointer to the namespace.
    
    This could be merged to 0b0dbb644794 ("VE/BINFTM: virtualization").
    
    Signed-off-by: Vladimir Davydov <vdavy...@parallels.com>
    
    +++
    ve/fs: Use ve_printk in fs/binfmt_aout.c
    
    This is a part of 74-diff-ve-mix-combined.
    
    https://jira.sw.ru/browse/PSBM-17903
    
    Signed-off-by: Kirill Tkhai <ktk...@parallels.com>
    
    +++
    ve/fs: Allow to mount binfmt_misc under non-root ns
    
    https://jira.sw.ru/browse/PSBM-40100
    
    v2: Check that user_ns is initial for the ve.
    v3: Be sure ve->init_cred is set.
    
    Signed-off-by: Kirill Tkhai <ktk...@odin.com>
    
    Acked-by: Vladimir Davydov <vdavy...@virtuozzo.com>
    
    khorenko@: in fact we allowed to do those mounts in top CT user ns only.
    
    +++
    ve/binfmt_misc: Allow mount if capable(CAP_SYS_ADMIN)
    
    The patch allows to mount binfmt_misc in a CT with ve0's admin caps,
    and it's need that for CRIU dump. This time, unmounted binfmt_misc
    may be forced mounted back, and we don't want to change CRIU's user_ns
    to do that.
    
    https://jira.sw.ru/browse/PSBM-47737
    
    Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
    
    Reviewed-by: Andrey Ryabinin <aryabi...@virtuozzo.com>
    
    +++
    ve/fs/binfmt_misc: store link to ve in sb->s_fs_info
    
    https://jira.sw.ru/browse/PSBM-85685
    
    Signed-off-by: Andrey Ryabinin <aryabi...@virtuozzo.com>
    
    +++
    Overrides:
    
    ve/fs/binfmt: store link to ve in sb->s_fs_info
    
    After rebase to RHEL7.5 sb->s_fs_info by default contains a link to "ns"
    provided to mount_ns(), but in binfmt_misc code we need a link to ve
    there, so adjust bm_fill_super() accordingly.
    
    https://jira.sw.ru/browse/PSBM-85052
    
    Signed-off-by: Konstantin Khorenko <khore...@virtuozzo.com>
    
    (cherry picked from commit bd9b1e8d6f856300df13e955340ed5a2e89d1b56 due
    to bug https://jira.sw.ru/browse/PSBM-103973)
    
    Signed-off-by: Valeriy Vdovin <valeriy.vdo...@virtuozzo.com>
    
    khorenko@: rebase to RHEL8.4 notes:
    - s/FS_USERNS_MOUNT/FS_VE_MOUNT/ and dropped extra checks in bm_mount()
    
    +++
    ve/fs/binfmt: fix EBUSY on mounting second binfmt_misc in CT
    
    After rebase to RHEL 8.4 binfmt_misc fs uses new fscontext API,
    our implementation is incorrect. We have to use get_tree_keyed()
    helper instead of get_tree_single() which allows only one
    superblock per HN.
    
    https://jira.sw.ru/browse/PSBM-132709
    
    Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalit...@virtuozzo.com>
    
    Reviewed-by: Pavel Tikhomirov <ptikhomi...@virtuozzo.com>
    
    +++
    ve/fs/binfmt: clean bm_data reference from ve on err path
    
    1. Make sure ve->binfmt_misc is NULL if error happens on binfmt_misc
    mount, otherwise on next attempt to mount binfmt_misc (probably
    successful) we won't even try to allocate/init structures for it.
    
    2. The current bm_fill_super() code makes us suppose we can get into
    the function with ve->binfmt_misc already initialized. If this is true
    and simple_fill_super() fails we will free preconfigured ve->binfmt_misc
    without proper deinitialization (ve_binfmt_fini()).
    
    Hopefully this is a wrong assumption, so rewrite the code not to confuse
    readers.
    
    https://jira.sw.ru/browse/PSBM-131994
    
    Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalit...@virtuozzo.com>
    
    Signed-off-by: Konstantin Khorenko <khore...@virtuozzo.com>
    
    (cherry picked from vz8 commit 8fc0d42d5d33e82f5238fb61561439407f208371)
    Signed-off-by: Andrey Zhadchenko <andrey.zhadche...@virtuozzo.com>
---
 fs/binfmt_aout.c   |   6 +--
 fs/binfmt_misc.c   | 106 +++++++++++++++++++++++++++++++++++++++++++++--------
 include/linux/ve.h |   4 ++
 kernel/ve/ve.c     |   3 ++
 4 files changed, 100 insertions(+), 19 deletions(-)

diff --git a/fs/binfmt_aout.c b/fs/binfmt_aout.c
index 145917f734fe..70488484dcdd 100644
--- a/fs/binfmt_aout.c
+++ b/fs/binfmt_aout.c
@@ -200,12 +200,12 @@ static int load_aout_binary(struct linux_binprm * bprm)
                if ((ex.a_text & 0xfff || ex.a_data & 0xfff) &&
                    (N_MAGIC(ex) != NMAGIC) && printk_ratelimit())
                {
-                       printk(KERN_NOTICE "executable not page aligned\n");
+                       ve_printk(VE_LOG, KERN_NOTICE "executable not page 
aligned\n");
                }
 
                if ((fd_offset & ~PAGE_MASK) != 0 && printk_ratelimit())
                {
-                       printk(KERN_WARNING 
+                       ve_printk(VE_LOG, KERN_WARNING
                               "fd_offset is not page aligned. Please convert 
program: %pD\n",
                               bprm->file);
                }
@@ -293,7 +293,7 @@ static int load_aout_library(struct file *file)
        if ((N_TXTOFF(ex) & ~PAGE_MASK) != 0) {
                if (printk_ratelimit())
                {
-                       printk(KERN_WARNING 
+                       ve_printk(VE_LOG, KERN_WARNING
                               "N_TXTOFF is not page aligned. Please convert 
library: %pD\n",
                               file);
                }
diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index e7f7117db9dd..da0dfe22095c 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -26,6 +26,8 @@
 #include <linux/fs_context.h>
 #include <linux/syscalls.h>
 #include <linux/fs.h>
+#include <linux/ve.h>
+
 #include <linux/uaccess.h>
 
 #include "internal.h"
@@ -70,11 +72,7 @@ struct binfmt_misc {
        int entry_count;
 };
 
-struct binfmt_misc binfmt_data = {
-       .entries        = LIST_HEAD_INIT(binfmt_data.entries),
-       .enabled        = 1,
-       .entries_lock   = __RW_LOCK_UNLOCKED(binfmt_data.entries_lock),
-};
+#define BINFMT_MISC(sb)                (((struct ve_struct 
*)(sb)->s_fs_info)->binfmt_misc)
 
 /*
  * Max length of the register string.  Determined by:
@@ -143,7 +141,7 @@ static int load_misc_binary(struct linux_binprm *bprm)
        Node *fmt;
        struct file *interp_file = NULL;
        int retval;
-       struct binfmt_misc *bm_data = &binfmt_data;
+       struct binfmt_misc *bm_data = get_exec_env()->binfmt_misc;
 
        retval = -ENOEXEC;
        if (!bm_data || !bm_data->enabled)
@@ -617,7 +615,7 @@ static ssize_t bm_entry_write(struct file *file, const char 
__user *buffer,
        Node *e = file_inode(file)->i_private;
        int res = parse_command(buffer, count);
        struct super_block *sb = file->f_path.dentry->d_sb;
-       struct binfmt_misc *bm_data = sb->s_fs_info;
+       struct binfmt_misc *bm_data = BINFMT_MISC(sb);
 
        switch (res) {
        case 1:
@@ -659,8 +657,8 @@ static ssize_t bm_register_write(struct file *file, const 
char __user *buffer,
        Node *e;
        struct inode *inode;
        struct super_block *sb = file_inode(file)->i_sb;
-       struct binfmt_misc *bm_data = sb->s_fs_info;
        struct dentry *root = sb->s_root, *dentry;
+       struct binfmt_misc *bm_data = BINFMT_MISC(sb);
        int err = 0;
        struct file *f = NULL;
 
@@ -737,7 +735,7 @@ static const struct file_operations bm_register_operations 
= {
 static ssize_t
 bm_status_read(struct file *file, char __user *buf, size_t nbytes, loff_t 
*ppos)
 {
-       struct binfmt_misc *bm_data = file->f_path.dentry->d_sb->s_fs_info;
+       struct binfmt_misc *bm_data = BINFMT_MISC(file->f_path.dentry->d_sb);
        char *s = bm_data->enabled ? "enabled\n" : "disabled\n";
 
        return simple_read_from_buffer(buf, nbytes, ppos, s, strlen(s));
@@ -746,7 +744,7 @@ bm_status_read(struct file *file, char __user *buf, size_t 
nbytes, loff_t *ppos)
 static ssize_t bm_status_write(struct file *file, const char __user *buffer,
                size_t count, loff_t *ppos)
 {
-       struct binfmt_misc *bm_data = file->f_path.dentry->d_sb->s_fs_info;
+       struct binfmt_misc *bm_data = BINFMT_MISC(file->f_path.dentry->d_sb);
        int res = parse_command(buffer, count);
        struct dentry *root;
 
@@ -785,9 +783,17 @@ static const struct file_operations bm_status_operations = 
{
 
 /* Superblock handling */
 
+static void bm_put_super(struct super_block *sb)
+{
+       struct binfmt_misc *bm_data = BINFMT_MISC(sb);
+
+       bm_data->enabled = 0;
+}
+
 static const struct super_operations s_ops = {
        .statfs         = simple_statfs,
        .evict_inode    = bm_evict_inode,
+       .put_super      = bm_put_super,
 };
 
 static int bm_fill_super(struct super_block *sb, struct fs_context *fc)
@@ -799,20 +805,79 @@ static int bm_fill_super(struct super_block *sb, struct 
fs_context *fc)
                /* last one */ {""}
        };
 
+       struct ve_struct *ve = get_exec_env();
+       struct binfmt_misc *bm_data;
+
+       /*
+        * bm_get_tree()
+        *  get_tree_keyed(fc, bm_fill_super, get_ve(ve))
+        *   fc->s_fs_info = current VE
+        *   vfs_get_super(fc, vfs_get_keyed_super, bm_fill_super)
+        *    sb = sget_fc(fc, test, set_anon_super_fc)
+        *    if (!sb->s_root) {
+        *              err = bm_fill_super(sb, fc);
+        *
+        * => we should never get here with initialized ve->binfmt_misc.
+        */
+       if (WARN_ON_ONCE(ve->binfmt_misc))
+               return -EEXIST;
+
+       bm_data = kzalloc(sizeof(struct binfmt_misc), GFP_KERNEL);
+       if (!bm_data)
+               return -ENOMEM;
+
+       INIT_LIST_HEAD(&bm_data->entries);
+       rwlock_init(&bm_data->entries_lock);
+
        err = simple_fill_super(sb, BINFMTFS_MAGIC, bm_files);
-       if (!err) {
-               sb->s_op = &s_ops;
-               sb->s_fs_info = &binfmt_data;
+       if (err) {
+               kfree(bm_data);
+               return err;
        }
-       return err;
+
+       sb->s_op = &s_ops;
+
+       ve->binfmt_misc = bm_data;
+       bm_data->enabled = 1;
+
+       return 0;
 }
 
 static int bm_get_tree(struct fs_context *fc)
 {
-       return get_tree_single(fc, bm_fill_super);
+       struct ve_struct *ve = get_exec_env();
+
+       /*
+        * We need one binfmt_misc superblock per VE,
+        * use get_tree_keyed() helper to get vfs_tree.
+        *
+        * It allows us to find sb by key (in our case ve is the key),
+        * and if it doesn't exists creates new.
+        *
+        * Important: we take ve refcnt here. It will be put
+        * in one of two places:
+        * 1. bm_free_fc()
+        * on error path (wrong mnt opt provided for instance)
+        * if sb exists and initialized already
+        * 2. bm_kill_sb() when sb refcnt becomes zero (last mount umounted)
+        */
+       return get_tree_keyed(fc, bm_fill_super, get_ve(ve));
+}
+
+static void bm_free_fc(struct fs_context *fc)
+{
+       /*
+        * fc->s_fs_info will be NULL if bm_fill_super() was called and
+        * no error occured (it means that new sb was allocated successfuly)
+        * see fs/super.c sget_fc() helper
+        */
+       if (fc->s_fs_info)
+               put_ve(fc->s_fs_info);
 }
 
+
 static const struct fs_context_operations bm_context_ops = {
+       .free           = bm_free_fc,
        .get_tree       = bm_get_tree,
 };
 
@@ -822,6 +887,14 @@ static int bm_init_fs_context(struct fs_context *fc)
        return 0;
 }
 
+static void bm_kill_sb(struct super_block *sb)
+{
+       struct ve_struct *ve = sb->s_fs_info;
+
+       kill_litter_super(sb);
+       put_ve(ve);
+}
+
 static struct linux_binfmt misc_format = {
        .module = THIS_MODULE,
        .load_binary = load_misc_binary,
@@ -831,7 +904,8 @@ static struct file_system_type bm_fs_type = {
        .owner          = THIS_MODULE,
        .name           = "binfmt_misc",
        .init_fs_context = bm_init_fs_context,
-       .kill_sb        = kill_litter_super,
+       .kill_sb        = bm_kill_sb,
+       .fs_flags       = FS_VIRTUALIZED | FS_VE_MOUNT,
 };
 MODULE_ALIAS_FS("binfmt_misc");
 
diff --git a/include/linux/ve.h b/include/linux/ve.h
index 3665e8c6a853..bb4246147c0d 100644
--- a/include/linux/ve.h
+++ b/include/linux/ve.h
@@ -56,6 +56,10 @@ struct ve_struct {
 
        struct kstat_lat_pcpu_struct    sched_lat_ve;
 
+#if IS_ENABLED(CONFIG_BINFMT_MISC)
+       struct binfmt_misc      *binfmt_misc;
+#endif
+
        struct kmapset_key      sysfs_perms_key;
 
        atomic_t                netns_avail_nr;
diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index 369df91efda4..a1888d6717f6 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -725,6 +725,9 @@ static void ve_destroy(struct cgroup_subsys_state *css)
        kmapset_unlink(&ve->sysfs_perms_key, &sysfs_ve_perms_set);
        ve_log_destroy(ve);
        ve_free_vdso(ve);
+#if IS_ENABLED(CONFIG_BINFMT_MISC)
+       kfree(ve->binfmt_misc);
+#endif
        free_percpu(ve->sched_lat_ve.cur);
        kmem_cache_free(ve_cachep, ve);
 }
_______________________________________________
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Reply via email to