From: Konstantin Khorenko <khore...@virtuozzo.com> "sync/fsync" called from inside a Container might have different behavior.
Affects sys_sync, sys_fsync, sys_fdatasync, sys_sync_file_range syscalls. aio_fsync (sys_io_submit) not affected. syncs cannot be disabled for ve0. All values described below (even if set on ve0) affect veX behavior only. Possible values for the Hardware Node: ====================================== 0 (FSYNC_NEVER) CT fsync and syncs are ignored 1 (FSYNC_ALWAYS) CT fsync and syncs work as usual, all inodes for all filesystem will be synced 2 (FSYNC_FILTERED) CT fsync as usual, syncs only its file data (only CT-relayed files and filesystems will be flushed) Possible values inside a Container: ====================================== 0 CT fsync and syncs are ignored 2 Use HN global value any other value Same as 2 (FSYNC_FILTERED) Default kernel value (for both HN and CT): 2 (FSYNC_FILTERED). ===================================================== ve/fs: Port fs.fsync-enable and fs.odirect_enable sysctls This is a part of 74-diff-ve-mix-combined. https://jira.sw.ru/browse/PSBM-17903 Signed-off-by: Kirill Tkhai <ktk...@parallels.com> ===================================================== ve/fs: check container odirect and fsync settings in __dentry_open sys_open for conventional filesystems doesn't call dentry_open, it calls __dentry_open (in nameidata_to_filp), so we have to move checks for odirect and fsync behaviour to __dentry_open to make them working on ploop containers. https://jira.sw.ru/browse/PSBM-17157 Signed-off-by: Dmitry Guryanov <dgurya...@parallels.com> Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org> ================================================ ve: initialize fsync_enable also for non ve0 environment Patchset description: ve: fix initialization and remove sysctl_fsync_enable v2: - initialize only on ve cgroup creation, remove get_ve_features - rename setup_iptables_mask into ve_setup_iptables_mask https://jira.sw.ru/browse/PSBM-34286 https://jira.sw.ru/browse/PSBM-34285 Pavel Tikhomirov (4): ve: remove sysctl_fsync_enable and use ve_fsync_behavior instead ve: initialize fsync_enable also for non ve0 environment ve: iptables: fix mask initialization and changing ve: cgroup: initialize odirect_enable, features and _randomize_va_space ===================================================================== Combined several vz7 patches into one: d35caf1 ("ve/fs/sync: per containter sync and syncfs") 3016bac ("ve: remove sync_mutex") 4cc281e ("ve: remove sysctl_fsync_enable and use ve_fsync_behavior instead") c3e4103 ("ve/fs: introduce "fs.fsync-enable" and "fs.odirect_enable" sysctls") fdbb570 ("fs: Restrict ve sync methods") VZ 8 rebase part https://jira.sw.ru/browse/PSBM-127782 Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalit...@virtuozzo.com> khorenko@ changes: - "2" -> "FSYNC_FILTERED" in a couple of places - - if (!sb_rdonly(sb) && sb->s_root && sb->s_bdi) + if (!sb_rdonly(sb) && sb->s_root && (sb->s_flags & SB_BORN)) +++ ve/msync: fix wrong behaviour of fs.fsync-enable When FSYNC_NEVER is set in container (in fs.fsync-enable sysctl) syncs should be ignored instead of failing with ENOMEM as we have now. https://jira.sw.ru/browse/PSBM-131652 Signed-off-by: Pavel Tikhomirov <ptikhomi...@virtuozzo.com> Acked-by: Alexander Mikhalitsyn <alexander.mikhalit...@virtuozzo.com> +++ ve/sync/mounts: skip cursor mounts when iterating over mnt_ns->list After RHEL ported "proc/mounts: add cursor" we need to iterate over mounts list in mntns more carefully: - Export mnt_list_next and move it out from CONFIG_PROC_FS; - Use mnt_list_next in sync_collect_filesystems to skip cursors. Otherwise kernel would break at dereferencing something from uninitialized cursor mount. https://jira.sw.ru/browse/PSBM-131158 Signed-off-by: Pavel Tikhomirov <ptikhomi...@virtuozzo.com> +++ fs/sync: fix nullptr dereference ve->ve_ns->mnt_ns ve_ns is not guaranteed to be non-NULL. Fix is_sb_ve_accessible() and sync_collect_filesystems() Also add rcu_dereference since ve->ve_ns is rcu-protected An example of shell commands to crash kernel: # mkdir /sys/fs/cgroup/ve/10001 # echo 10001 > /sys/fs/cgroup/ve/10001/ve.veid # echo $$ > /sys/fs/cgroup/ve/10001/tasks # sync [59390.889322] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 [59390.889395] PGD 0 P4D 0 [59390.889442] Oops: 0000 [#1] SMP PTI [59390.889492] CPU: 1 PID: 8950 Comm: sync ve: 10001 Kdump: loaded Not tainted 4.18.0-240.1.1.vz8.5.47 #1 5.47 [59390.889554] Hardware name: Virtuozzo KVM, BIOS 1.10.2-3.1.vz7.3 04/01/2014 [59390.889622] RIP: 0010:sync_filesystems_ve+0x34/0x220 [59390.889673] Code: 55 41 54 55 53 48 83 ec 20 65 48 8b 04 25 28 00 00 00 48 89 44 24 18 31 c0 48 8b 87 98 01 00 00 48 8d 6c 24 08 48 89 6c 24 08 <4c> 8b 68 18 48 8b 44 24 08 48 89 6c 24 10 48 39 c5 0f 85 ce 01 00 [59390.889798] RSP: 0018:ffffb1b7810a7ec0 EFLAGS: 00010246 [59390.889849] RAX: 0000000000000000 RBX: ffff92309ab7c418 RCX: 0000000000000000 [59390.889903] RDX: ffff92308bbff180 RSI: 0000000000000000 RDI: ffff92309ab7c418 [59390.889958] RBP: ffffb1b7810a7ec8 R08: 0000000000000000 R09: 0000000000000000 [59390.890016] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [59390.890071] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [59390.890126] FS: 00007fd7880b6540(0000) GS:ffff9230bbb00000(0000) knlGS:0000000000000000 [59390.890184] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [59390.890235] CR2: 0000000000000018 CR3: 000000010b22e000 CR4: 00000000000006e0 [59390.890293] Call Trace: [59390.890351] ? __do_page_fault+0x23a/0x4f0 [59390.890407] ksys_sync+0x10d/0x130 [59390.890456] __ia32_sys_sync+0xa/0x10 [59390.890509] do_syscall_64+0x5b/0x1a0 [59390.890562] entry_SYSCALL_64_after_hwframe+0x65/0xca [59390.890620] RIP: 0033:0x7fd787fe4ffb [59390.890667] Code: c3 48 8b 0d a7 8e 0c 00 f7 d8 64 89 01 b8 ff ff ff ff eb c2 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 a2 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 8e 0c 00 f7 d8 64 89 01 48 [59390.890791] RSP: 002b:00007ffd853dd328 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2 [59390.890848] RAX: ffffffffffffffda RBX: 00007ffd853dd468 RCX: 00007fd787fe4ffb [59390.890903] RDX: 00007fd7880b2001 RSI: 0000000000000000 RDI: 00007fd788079b5e [59390.890957] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000 [59390.891012] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [59390.891067] R13: 0000000000000000 R14: 0000000000000000 R15: 00007fd7880ae1b4 [59390.896038] CR2: 0000000000000018 https://jira.sw.ru/browse/PSBM-130894 Signed-off-by: Andrey Zhadchenko <andrey.zhadche...@virtuozzo.com> v2: move new sync_filesystems code under namespace_sem to ensure mnt_ns won't dissapear unexpectedly (cherry picked from vz8 commit 5a96860dcd780c5caaaaf7c95cbefc764cd7f88a) Signed-off-by: Andrey Zhadchenko <andrey.zhadche...@virtuozzo.com> --- fs/fcntl.c | 2 + fs/mount.h | 2 + fs/namespace.c | 8 +- fs/open.c | 3 + fs/sync.c | 213 +++++++++++++++++++++++++++++++++++++++++++++++++++- include/linux/fs.h | 12 +++ include/linux/ve.h | 2 + kernel/ve/ve.c | 3 + kernel/ve/veowner.c | 8 ++ mm/msync.c | 2 + 10 files changed, 249 insertions(+), 6 deletions(-) diff --git a/fs/fcntl.c b/fs/fcntl.c index 2e0c851..8af146e 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -68,6 +68,8 @@ static int setfl(int fd, struct file * filp, unsigned long arg) if (!may_use_odirect()) arg &= ~O_DIRECT; + if (ve_fsync_behavior() == FSYNC_NEVER) + arg &= ~O_SYNC; /* * O_APPEND cannot be cleared if the file is marked as append-only * and the file is open for write. diff --git a/fs/mount.h b/fs/mount.h index e19f732..7c6b724 100644 --- a/fs/mount.h +++ b/fs/mount.h @@ -100,6 +100,8 @@ static inline int is_mounted(struct vfsmount *mnt) return !IS_ERR_OR_NULL(real_mount(mnt)->mnt_ns); } +extern struct rw_semaphore namespace_sem; + extern struct mount *__lookup_mnt(struct vfsmount *, struct dentry *); extern int __legitimize_mnt(struct vfsmount *, unsigned); diff --git a/fs/namespace.c b/fs/namespace.c index c106149..7af19eb 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -72,7 +72,7 @@ static int __init set_mphash_entries(char *str) static struct hlist_head *mount_hashtable __read_mostly; static struct hlist_head *mountpoint_hashtable __read_mostly; static struct kmem_cache *mnt_cache __read_mostly; -static DECLARE_RWSEM(namespace_sem); +DECLARE_RWSEM(namespace_sem); static HLIST_HEAD(unmounted); /* protected by namespace_sem */ static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */ @@ -1300,9 +1300,8 @@ struct vfsmount *mnt_clone_internal(const struct path *path) return &p->mnt; } -#ifdef CONFIG_PROC_FS -static struct mount *mnt_list_next(struct mnt_namespace *ns, - struct list_head *p) +struct mount *mnt_list_next(struct mnt_namespace *ns, + struct list_head *p) { struct mount *mnt, *ret = NULL; @@ -1319,6 +1318,7 @@ static struct mount *mnt_list_next(struct mnt_namespace *ns, return ret; } +#ifdef CONFIG_PROC_FS /* iterator; we want it to have access to namespace_sem, thus here... */ static void *m_start(struct seq_file *m, loff_t *pos) { diff --git a/fs/open.c b/fs/open.c index 040df8b..65e60aa 100644 --- a/fs/open.c +++ b/fs/open.c @@ -785,6 +785,9 @@ static int do_dentry_open(struct file *f, if (!may_use_odirect()) f->f_flags &= ~O_DIRECT; + if (ve_fsync_behavior() == FSYNC_NEVER) + f->f_flags &= ~O_SYNC; + if (unlikely(f->f_flags & O_PATH)) { f->f_mode = FMODE_PATH | FMODE_OPENED; f->f_op = &empty_fops; diff --git a/fs/sync.c b/fs/sync.c index 1373a61..1c78756 100644 --- a/fs/sync.c +++ b/fs/sync.c @@ -8,6 +8,7 @@ #include <linux/fs.h> #include <linux/slab.h> #include <linux/export.h> +#include <linux/mount.h> #include <linux/namei.h> #include <linux/sched.h> #include <linux/writeback.h> @@ -16,7 +17,9 @@ #include <linux/pagemap.h> #include <linux/quotaops.h> #include <linux/backing-dev.h> +#include <linux/ve.h> #include "internal.h" +#include "mount.h" #define VALID_FLAGS (SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE| \ SYNC_FILE_RANGE_WAIT_AFTER) @@ -96,6 +99,160 @@ static void fdatawait_one_bdev(struct block_device *bdev, void *arg) filemap_fdatawait_keep_errors(bdev->bd_inode->i_mapping); } +struct sync_sb { + struct list_head list; + struct super_block *sb; +}; + +static void sync_release_filesystems(struct list_head *sync_list) +{ + struct sync_sb *ss, *tmp; + + list_for_each_entry_safe(ss, tmp, sync_list, list) { + list_del(&ss->list); + put_super(ss->sb); + kfree(ss); + } +} + +static int sync_filesystem_collected(struct list_head *sync_list, struct super_block *sb) +{ + struct sync_sb *ss; + + list_for_each_entry(ss, sync_list, list) + if (ss->sb == sb) + return 1; + return 0; +} + +static int sync_collect_filesystems(struct ve_struct *ve, struct list_head *sync_list) +{ + struct mount *mnt; + struct mnt_namespace *mnt_ns; + struct nsproxy *ve_ns; + struct sync_sb *ss; + int ret = 0; + + BUG_ON(!list_empty(sync_list)); + + down_read(&namespace_sem); + + rcu_read_lock(); + ve_ns = rcu_dereference(ve->ve_ns); + if (!ve_ns) { + rcu_read_unlock(); + up_read(&namespace_sem); + return 0; + } + mnt_ns = ve_ns->mnt_ns; + rcu_read_unlock(); + + mnt = mnt_list_next(mnt_ns, &mnt_ns->list); + while (mnt) { + if (sync_filesystem_collected(sync_list, mnt->mnt.mnt_sb)) + goto next; + + ss = kmalloc(sizeof(*ss), GFP_KERNEL); + if (ss == NULL) { + ret = -ENOMEM; + break; + } + ss->sb = mnt->mnt.mnt_sb; + /* + * We hold mount point and thus can be sure, that superblock is + * alive. And it means, that we can safely increase it's usage + * counter. + */ + spin_lock(&sb_lock); + ss->sb->s_count++; + spin_unlock(&sb_lock); + list_add_tail(&ss->list, sync_list); +next: + mnt = mnt_list_next(mnt_ns, &mnt->mnt_list); + } + up_read(&namespace_sem); + return ret; +} + +static void sync_filesystems_ve(struct ve_struct *ve, int wait) +{ + struct super_block *sb; + LIST_HEAD(sync_list); + struct sync_sb *ss; + + /* + * We don't need to care about allocating failure here. At least we + * don't need to skip sync on such error. + * Let's sync what we collected already instead. + */ + sync_collect_filesystems(ve, &sync_list); + + list_for_each_entry(ss, &sync_list, list) { + sb = ss->sb; + down_read(&sb->s_umount); + if (!sb_rdonly(sb) && sb->s_root && (sb->s_flags & SB_BORN)) + __sync_filesystem(sb, wait); + up_read(&sb->s_umount); + } + + sync_release_filesystems(&sync_list); +} + +static int is_sb_ve_accessible(struct ve_struct *ve, struct super_block *sb) +{ + struct mount *mnt; + struct mnt_namespace *mnt_ns; + struct nsproxy *ve_ns; + int ret = 0; + + down_read(&namespace_sem); + + rcu_read_lock(); + ve_ns = rcu_dereference(ve->ve_ns); + if (!ve_ns) { + rcu_read_unlock(); + up_read(&namespace_sem); + return 0; + } + mnt_ns = ve_ns->mnt_ns; + rcu_read_unlock(); + + list_for_each_entry(mnt, &mnt_ns->list, mnt_list) { + if (mnt->mnt.mnt_sb == sb) { + ret = 1; + break; + } + } + up_read(&namespace_sem); + return ret; +} + +static int __ve_fsync_behavior(struct ve_struct *ve) +{ + /* + * - __ve_fsync_behavior() is not called for ve0 + * - FSYNC_FILTERED for veX does NOT mean "filtered" behavior + * - FSYNC_FILTERED for veX means "get value from ve0" + */ + if (ve->fsync_enable == FSYNC_FILTERED) + return get_ve0()->fsync_enable; + else if (ve->fsync_enable) + return FSYNC_FILTERED; /* sync forced by ve is always filtered */ + else + return 0; +} + +int ve_fsync_behavior(void) +{ + struct ve_struct *ve; + + ve = get_exec_env(); + if (ve_is_super(ve)) + return FSYNC_ALWAYS; + else + return __ve_fsync_behavior(ve); +} + /* * Sync everything. We start by waking flusher threads so that most of * writeback runs on all devices in parallel. Then we sync all inodes reliably @@ -108,8 +265,32 @@ static void fdatawait_one_bdev(struct block_device *bdev, void *arg) */ void ksys_sync(void) { + struct ve_struct *ve = get_exec_env(); int nowait = 0, wait = 1; + if (!ve_is_super(ve)) { + int fsb; + /* + * init can't sync during VE stop. Rationale: + * - NFS with -o hard will block forever as network is down + * - no useful job is performed as VE0 will call umount/sync + * by his own later + * Den + */ + if (is_child_reaper(task_pid(current))) + return; + + fsb = __ve_fsync_behavior(ve); + if (fsb == FSYNC_NEVER) + return; + + if (fsb == FSYNC_FILTERED) { + sync_filesystems_ve(ve, nowait); + sync_filesystems_ve(ve, wait); + return; + } + } + wakeup_flusher_threads(WB_REASON_SYNC); iterate_supers(sync_inodes_one_sb, NULL); iterate_supers(sync_fs_one_sb, &nowait); @@ -162,18 +343,42 @@ void emergency_sync(void) { struct fd f = fdget(fd); struct super_block *sb; - int ret, ret2; + int ret = 0, ret2 = 0; + struct ve_struct *ve; if (!f.file) return -EBADF; sb = f.file->f_path.dentry->d_sb; + ve = get_exec_env(); + + if (!ve_is_super(ve)) { + int fsb; + /* + * init can't sync during VE stop. Rationale: + * - NFS with -o hard will block forever as network is down + * - no useful job is performed as VE0 will call umount/sync + * by his own later + * Den + */ + if (is_child_reaper(task_pid(current))) + goto fdput; + + fsb = __ve_fsync_behavior(ve); + if (fsb == FSYNC_NEVER) + goto fdput; + + if ((fsb == FSYNC_FILTERED) && !is_sb_ve_accessible(ve, sb)) + goto fdput; + } + down_read(&sb->s_umount); ret = sync_filesystem(sb); up_read(&sb->s_umount); ret2 = errseq_check_and_advance(&sb->s_wb_err, &f.file->f_sb_err); +fdput: fdput(f); return ret ? ret : ret2; } @@ -217,9 +422,13 @@ int vfs_fsync(struct file *file, int datasync) static int do_fsync(unsigned int fd, int datasync) { - struct fd f = fdget(fd); + struct fd f; int ret = -EBADF; + if (ve_fsync_behavior() == FSYNC_NEVER) + return 0; + + f = fdget(fd); if (f.file) { ret = vfs_fsync(f.file, datasync); fdput(f); diff --git a/include/linux/fs.h b/include/linux/fs.h index 01419db..42021f0 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -71,6 +71,7 @@ struct fs_context; struct fs_parameter_spec; struct fileattr; +struct mnt_namespace; extern void __init inode_init(void); extern void __init inode_init_early(void); @@ -2521,6 +2522,7 @@ extern struct dentry *mount_nodev(struct file_system_type *fs_type, void kill_litter_super(struct super_block *sb); void deactivate_super(struct super_block *sb); void deactivate_locked_super(struct super_block *sb); +void put_super(struct super_block *sb); int set_anon_super(struct super_block *s, void *data); int set_anon_super_fc(struct super_block *s, struct fs_context *fc); int get_anon_bdev(dev_t *); @@ -3146,6 +3148,13 @@ static inline void i_readcount_inc(struct inode *inode) extern char *file_path(struct file *, char *, int); +int ve_fsync_behavior(void); + +#define FSYNC_NEVER 0 /* ve syncs are ignored */ +#define FSYNC_ALWAYS 1 /* ve syncs work as ususal */ +#define FSYNC_FILTERED 2 /* ve syncs only its files */ +/* For non-ve0 FSYNC_FILTERED value means "get value from ve0". */ + #include <linux/err.h> /* needed for stackable file system support */ @@ -3495,6 +3504,9 @@ void setattr_copy(struct user_namespace *, struct inode *inode, extern int file_update_time(struct file *file); +extern struct mount *mnt_list_next(struct mnt_namespace *ns, + struct list_head *p); + static inline bool vma_is_dax(const struct vm_area_struct *vma) { return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host); diff --git a/include/linux/ve.h b/include/linux/ve.h index 3d5a1dc..ad1c4710 100644 --- a/include/linux/ve.h +++ b/include/linux/ve.h @@ -57,6 +57,8 @@ struct ve_struct { struct kstat_lat_pcpu_struct sched_lat_ve; int odirect_enable; + int fsync_enable; + #if IS_ENABLED(CONFIG_BINFMT_MISC) struct binfmt_misc *binfmt_misc; #endif diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c index e8616d9..fe4c4d9 100644 --- a/kernel/ve/ve.c +++ b/kernel/ve/ve.c @@ -56,6 +56,7 @@ struct ve_struct ve0 = { .sched_lat_ve.cur = &ve0_lat_stats, .netns_avail_nr = ATOMIC_INIT(INT_MAX), .netns_max_nr = INT_MAX, + .fsync_enable = FSYNC_FILTERED, ._randomize_va_space = #ifdef CONFIG_COMPAT_BRK 1, @@ -678,6 +679,8 @@ static struct cgroup_subsys_state *ve_create(struct cgroup_subsys_state *parent_ ve->meminfo_val = VE_MEMINFO_DEFAULT; ve->odirect_enable = 2; + /* for veX FSYNC_FILTERED means "get value from ve0 */ + ve->fsync_enable = FSYNC_FILTERED; atomic_set(&ve->netns_avail_nr, NETNS_MAX_NR_DEFAULT); ve->netns_max_nr = NETNS_MAX_NR_DEFAULT; diff --git a/kernel/ve/veowner.c b/kernel/ve/veowner.c index b0aba35..e255fe5 100644 --- a/kernel/ve/veowner.c +++ b/kernel/ve/veowner.c @@ -7,6 +7,7 @@ * */ +#include <linux/ve.h> #include <linux/init.h> #include <linux/module.h> #include <linux/proc_fs.h> @@ -66,6 +67,13 @@ static void prepare_proc(void) .extra1 = &ve_mount_nr_min, .extra2 = &ve_mount_nr_max, }, + { + .procname = "fsync-enable", + .data = &ve0.fsync_enable, + .maxlen = sizeof(int), + .mode = 0644 | S_ISVTX, + .proc_handler = &proc_dointvec_virtual, + }, { } }; diff --git a/mm/msync.c b/mm/msync.c index 137d1c1..20737eb 100644 --- a/mm/msync.c +++ b/mm/msync.c @@ -51,6 +51,8 @@ if (end < start) goto out; error = 0; + if (ve_fsync_behavior() == FSYNC_NEVER) + goto out; if (end == start) goto out; /* -- 1.8.3.1 _______________________________________________ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel