The commit is pushed to "branch-rh9-5.14.vz9.1.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after ark-5.14 ------> commit 2731a47d966376ee02b038a3356ad3ec38243c8e Author: Konstantin Khorenko <khore...@virtuozzo.com> Date: Mon Oct 4 20:39:08 2021 +0300
ve/vfs: introduce "fs.odirect_enable" sysctl and disable it by default We've observed a situation when in case of many Containers on a node even small direct disk io in each CT brings the whole node to knees (100 CTs, 5 lines of logs written each 20-30 seconds). The node had surely slow hdds. Note, that this significantly slows down async reads: they can be direct only, if they are called in cached mode, they effectively became synchronous in case > 1 writers. Example: # fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 \ --name=test --filename=test --bs=4k --iodepth=64 --size=1G \ --readwrite=randrw --rwmixread=75 The vps here resulted in 20MB/s read and 6.8MB/s write, while other VPS (with O_DIRECT enabled) resulted in 230MB/s read and 76MB/s write. The root cause is known: libaio becomes synchronous in case of cached io. So the userspace is better check if underlying disk is fast enough and enable O_DIRECT in those cases. https://jira.sw.ru/browse/PSBM-53458 https://jira.sw.ru/browse/PSBM-68005 https://jira.sw.ru/browse/PSBM-68656 https://jira.sw.ru/browse/PSBM-100671 https://jira.sw.ru/browse/PSBM-104338 Signed-off-by: Konstantin Khorenko <khore...@virtuozzo.com> =============================================================== =============================================================== Original commit message: commit f5829bccbd390437013bd914d68caabf79d09b3e Author: Konstantin Khorenko <khore...@virtuozzo.com> Date: Mon Dec 11 23:00:45 2017 +0300 ve/fs: introduce "fs.fsync-enable" and "fs.odirect_enable" sysctls ve/vfs: introduce "odirect_enable" sysctl and disable it by default khorenko@: we want to disable direct access from inside Container because this is limited numbers of direct requests available on the system (128), and in case they are busy next request is provided only after some requst is completed. There is no any scheduler at this level => DDoS is possible from inside a CT: just run _many_ processes writing with O_DIRECT. diff-vfs-odirect-enable && diff-vfs-odirect-enable-location-fix Signed-off-by: Kirill Tkhai <ktk...@parallels.com> +++ ve/fs: Port fs.fsync-enable and fs.odirect_enable sysctls This is a part of 74-diff-ve-mix-combined. https://jira.sw.ru/browse/PSBM-17903 Signed-off-by: Kirill Tkhai <ktk...@parallels.com> ===================================================== ve/fs: check container odirect and fsync settings in __dentry_open sys_open for conventional filesystems doesn't call dentry_open, it calls __dentry_open (in nameidata_to_filp), so we have to move checks for odirect and fsync behaviour to __dentry_open to make them working on ploop containers. https://jira.sw.ru/browse/PSBM-17157 Signed-off-by: Dmitry Guryanov <dgurya...@parallels.com> Acked-by: Dmitry Monakhov <dmonak...@openvz.org> Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org> ================================================ ve: initialize fsync_enable also for non ve0 environment Patchset description: ve: fix initialization and remove sysctl_fsync_enable v2: - initialize only on ve cgroup creation, remove get_ve_features - rename setup_iptables_mask into ve_setup_iptables_mask https://jira.sw.ru/browse/PSBM-34286 https://jira.sw.ru/browse/PSBM-34285 Pavel Tikhomirov (4): ve: remove sysctl_fsync_enable and use ve_fsync_behavior instead ve: initialize fsync_enable also for non ve0 environment ve: iptables: fix mask initialization and changing ve: cgroup: initialize odirect_enable, features and _randomize_va_space ===================================================================== This patch description: v2: only on ve cgroup creation https://jira.sw.ru/browse/PSBM-34286 Signed-off-by: Pavel Tikhomirov <ptikhomi...@virtuozzo.com> Acked-by: Dmitry Monakhov <dmonak...@openvz.org> (cherry picked from vz8 commit 166db3147c1b29b4247e50eeae6c18f4ca88c162) Signed-off-by: Andrey Zhadchenko <andrey.zhadche...@virtuozzo.com> --- fs/fcntl.c | 30 ++++++++++++++++++++++++++++++ fs/open.c | 3 +++ include/linux/fs.h | 2 ++ include/linux/ve.h | 1 + kernel/sysctl.c | 7 +++++++ kernel/ve/ve.c | 2 ++ 6 files changed, 45 insertions(+) diff --git a/fs/fcntl.c b/fs/fcntl.c index 714e7c9a5fc4..2e0c8515bd1a 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -26,6 +26,7 @@ #include <linux/memfd.h> #include <linux/compat.h> #include <linux/mount.h> +#include <linux/ve.h> #include <linux/poll.h> #include <asm/siginfo.h> @@ -33,11 +34,40 @@ #define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME) +/* + * Host is always allowed to use O_DIRECT. + * Host's value of sysctl "fs.odirect_enable" might affect Containers only. + * + * Container's "fs.odirect_enable" sysctl value means: + * 0: Container ignores O_DIRECT flag + * 1: Container honors O_DIRECT flag (in fact, any X>0 && X != 2) + * 2: Container checks the host's sysctl value and work according it + */ +int may_use_odirect(void) +{ + int may; + + if (ve_is_super(get_exec_env())) + return 1; + + may = capable(CAP_SYS_RAWIO); + if (!may) { + may = get_exec_env()->odirect_enable; + if (may == 2) + may = get_ve0()->odirect_enable; + } + + return may; +} + static int setfl(int fd, struct file * filp, unsigned long arg) { struct inode * inode = file_inode(filp); int error = 0; + if (!may_use_odirect()) + arg &= ~O_DIRECT; + /* * O_APPEND cannot be cleared if the file is marked as append-only * and the file is open for write. diff --git a/fs/open.c b/fs/open.c index 21c941193783..040df8bc6e76 100644 --- a/fs/open.c +++ b/fs/open.c @@ -782,6 +782,9 @@ static int do_dentry_open(struct file *f, f->f_wb_err = filemap_sample_wb_err(f->f_mapping); f->f_sb_err = file_sample_sb_err(f); + if (!may_use_odirect()) + f->f_flags &= ~O_DIRECT; + if (unlikely(f->f_flags & O_PATH)) { f->f_mode = FMODE_PATH | FMODE_OPENED; f->f_op = &empty_fops; diff --git a/include/linux/fs.h b/include/linux/fs.h index f4f31c29259f..8d772f6822bb 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -186,6 +186,8 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset, /* File supports async buffered reads */ #define FMODE_BUF_RASYNC ((__force fmode_t)0x40000000) +extern int may_use_odirect(void); + /* * Attribute flags. These should be or-ed together to figure out what * has been changed! diff --git a/include/linux/ve.h b/include/linux/ve.h index bb4246147c0d..4ecfba601f45 100644 --- a/include/linux/ve.h +++ b/include/linux/ve.h @@ -55,6 +55,7 @@ struct ve_struct { #define VE_LOG_BUF_LEN 4096 struct kstat_lat_pcpu_struct sched_lat_ve; + int odirect_enable; #if IS_ENABLED(CONFIG_BINFMT_MISC) struct binfmt_misc *binfmt_misc; diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 9321aa702a25..55054f136f68 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -3443,6 +3443,13 @@ static struct ctl_table fs_table[] = { .child = sysctl_mount_point, }, #endif + { + .procname = "odirect_enable", + .data = &ve0.odirect_enable, + .maxlen = sizeof(int), + .mode = 0644 | S_ISVTX, + .proc_handler = proc_dointvec_virtual, + }, { .procname = "pipe-max-size", .data = &pipe_max_size, diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c index a1888d6717f6..5822f74ea2a2 100644 --- a/kernel/ve/ve.c +++ b/kernel/ve/ve.c @@ -629,6 +629,8 @@ static struct cgroup_subsys_state *ve_create(struct cgroup_subsys_state *parent_ ve->meminfo_val = VE_MEMINFO_DEFAULT; + ve->odirect_enable = 2; + atomic_set(&ve->netns_avail_nr, NETNS_MAX_NR_DEFAULT); ve->netns_max_nr = NETNS_MAX_NR_DEFAULT; _______________________________________________ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel