[Devel] [PATCH RH9 00/12] part4 (sched part)

2021-09-23 Thread Alexander Mikhalitsyn
https://jira.sw.ru/browse/PSBM-133986 Kirill Tkhai (11): sched: Add ve name to sched_show_task() sched: disable dumping cfs info on sysrq trigger kernel: Account nr_zombie and nr_dead sched: Count rq::nr_sleeping and cfs_rq::nr_unint sched: Account cfs_rq::nr_iowait sched: Add primitiv

[Devel] [PATCH RH9 12/12] sched: Return only virtual cpus in sched_getaffinity()

2021-09-23 Thread Alexander Mikhalitsyn
From: Kirill Tkhai Fill mask using virtual cpus which are enumerated from 0 to num_online_vcpus()-1. Just in pair to /proc/cpuinfo and sched_setaffinity(). https://jira.sw.ru/browse/PSBM-25367 Signed-off-by: Kirill Tkhai Acked-by: Vladimir Davydov An addition by khorenko@: "nproc" utility wo

[Devel] [PATCH RH9 03/12] kernel: Account nr_zombie and nr_dead

2021-09-23 Thread Alexander Mikhalitsyn
From: Kirill Tkhai Extracted from "Initial patch". Signed-off-by: Kirill Tkhai https://jira.sw.ru/browse/PSBM-133986 (cherry picked from commit b097bc8d02100cc6c8fa55433fc7a059b04c37fb) Signed-off-by: Alexander Mikhalitsyn --- include/linux/sched.h | 3 +++ kernel/exit.c | 14 +

[Devel] [PATCH RH9 04/12] sched: Count rq::nr_sleeping and cfs_rq::nr_unint

2021-09-23 Thread Alexander Mikhalitsyn
From: Kirill Tkhai Extracted from "Initial patch". Note: it will be better to move nr_unint to struct task_group in the future. Signed-off-by: Kirill Tkhai Rebase to RHEL8.3 kernel-4.18.0-240.1.1.el8_3 notes: khorenko@: i've substituted task_contributes_to_load() with tsk->sched_contrib

[Devel] [PATCH RH9 01/12] sched: Add ve name to sched_show_task()

2021-09-23 Thread Alexander Mikhalitsyn
From: Kirill Tkhai Extracted from "ve: initial patch". rcu_read_unlock() move by Andrey Ryabinin. Signed-off-by: OpenVZ Team Signed-off-by: Andrey Ryabinin Signed-off-by: Kirill Tkhai https://jira.sw.ru/browse/PSBM-133986 See also cc172ff301 ("sched/debug: Fix the alignment of the show-sta

[Devel] [PATCH RH9 08/12] sched: Port CONFIG_CFS_CPULIMIT feature

2021-09-23 Thread Alexander Mikhalitsyn
From: Kirill Tkhai Add posibility to limit cpus used by cgroup/container. Signed-off-by: Vladimir Davydov Signed-off-by: Kirill Tkhai +++ sched: Allow configuring sched_vcpu_hotslice and sched_cpulimit_scale_cpufreq Let's make our sysctls ported from vz8 to be really configurable. These are

[Devel] [PATCH RH9 06/12] sched: Add primitives to calculate nr running, sleeping, stopped and uninterruptible tasks

2021-09-23 Thread Alexander Mikhalitsyn
From: Kirill Tkhai Extracted from "Initial patch". Signed-off-by: Kirill Tkhai https://jira.sw.ru/browse/PSBM-133986 unsigned long -> unsigned int (cherry picked from commit 153ef777fd0e3be36ed2d4cb28de0449f7e14c41) Signed-off-by: Alexander Mikhalitsyn --- include/linux/sched/stat.h | 2 +

[Devel] [PATCH RH9 11/12] sched: prohibit setting affinity from inside a CT

2021-09-23 Thread Alexander Mikhalitsyn
From: Konstantin Khorenko Signed-off-by: OpenVZ Team Extracted by Konstantin Khorenko Signed-off-by: Kirill Tkhai https://jira.sw.ru/browse/PSBM-133986 (cherry picked from commit b055e02b4378443778b38ef77712b803f9bcb19f) Signed-off-by: Alexander Mikhalitsyn --- kernel/sched/core.c | 3 +++

[Devel] [PATCH RH9 05/12] sched: Account cfs_rq::nr_iowait

2021-09-23 Thread Alexander Mikhalitsyn
From: Kirill Tkhai Extracted from "Initial patch". Signed-off-by: Kirill Tkhai +++ sched: fix cfs_rq::nr_iowait accounting After recent RedHat (b6be9ae "rh7: import RHEL7 kernel-3.10.0-957.12.2.el7") following sequence: update_stats_dequeue() dequeue_sleeper() cfs_rq->nr_iowait++

[Devel] [PATCH RH9 10/12] sched: Add cpulimit cgroup interfaces

2021-09-23 Thread Alexander Mikhalitsyn
From: Kirill Tkhai Add CONFIG_CPULIMIT cpu cgroup files. Signed-off-by: Kirill Tkhai https://jira.sw.ru/browse/PSBM-133986 See f4183717b ("sched/fair: Introduce the burstable CFS controller") (cherry picked from commit 0e0e0bfbf884f8fb0347e8cb6ed27aa2bf991c91) Signed-off-by: Alexander Mikhal

[Devel] [PATCH RH9 02/12] sched: disable dumping cfs info on sysrq trigger

2021-09-23 Thread Alexander Mikhalitsyn
From: Kirill Tkhai This results in soft lockups, because it writes too much data to console. At the same time information it shows is only useful for sched debugging and can be obtained via /proc/sched_debug anyway. Besides, it is disabled in PCS6. So disable it in vz7 either. https://jira.sw.ru

[Devel] [PATCH RH9 09/12] sched: Split tg_set_cfs_bandwidth() and export default_cfs_period()

2021-09-23 Thread Alexander Mikhalitsyn
From: Kirill Tkhai This is needed for CONFIG_CFS_CPULIMIT. Signed-off-by: Kirill Tkhai https://jira.sw.ru/browse/PSBM-133986 (cherry picked from commit dcff66d4a02c9fc56536b3e002b4dee3e9efd3fa) Signed-off-by: Alexander Mikhalitsyn --- kernel/sched/core.c | 20 +++- kernel/s

[Devel] [PATCH RH9 07/12] sched: Split task_h_load()

2021-09-23 Thread Alexander Mikhalitsyn
From: Kirill Tkhai Refactoring needed for CONFIG_CFS_CPULIMIT. Signed-off-by: Kirill Tkhai https://jira.sw.ru/browse/PSBM-133986 (cherry picked from commit f511f958d4294fe50e1b42661a5bd7012b61f82a) Signed-off-by: Alexander Mikhalitsyn --- kernel/sched/fair.c | 15 ++- 1 file cha

[Devel] [PATCH rh7] ploop: Fix partition refcounter leak

2021-09-23 Thread Konstantin Khorenko
ms commit b7d6c3033323 ("block: fix use-after-free on cached last_lookup partition") has changed the semantic of disk_map_sector_rcu() function: previously it only returned appropriate partition but after the patch it additionally gets a refcounter on that partition. RedHat has backported that pat

[Devel] [PATCH RH9 00/11] part4 (almost all rest)

2021-09-23 Thread Alexander Mikhalitsyn
https://jira.sw.ru/browse/PSBM-133986 Kirill Tkhai (1): ve/net/ip_gre: containerize per-net devices Konstantin Khorenko (4): ve/net/cred: add ve_capable to check capabilities relative to the current VE (v2) ve/net/vxlan: enable support in a container ve/uevent: Use own uevent_seqnum f

[Devel] [PATCH RH9 02/11] ve/kernel: allow to increase rlimit from inside container

2021-09-23 Thread Alexander Mikhalitsyn
From: Vladimir Davydov This works on PCS6, so we should allow it on Vz7 either. https://jira.sw.ru/browse/PSBM-43410 Signed-off-by: Vladimir Davydov https://jira.sw.ru/browse/PSBM-133986 (cherry picked from commit 23eff01c369131f7e4cab1c37de1c253266d5039) Signed-off-by: Alexander Mikhalitsyn

[Devel] [PATCH RH9 04/11] ve/net/dummy: enable support in a container

2021-09-23 Thread Alexander Mikhalitsyn
From: Vasily Averin This patch enables dummy network devices support in a container. https://jira.sw.ru/browse/PSBM-43329 khorenko@: JFYI: up to now it's required by ltp test only (containers.netns_sysfs testcase). Signed-off-by: Vasily Averin https://jira.sw.ru/browse/PSBM-133986 (cherry p

[Devel] [PATCH RH9 01/11] device_cgroup: add device visibility virtualization in CT

2021-09-23 Thread Alexander Mikhalitsyn
From: Pavel Tikhomirov For device cgroup(whitelist based case) we hide all devices, which are not in a whitelist from processes in these cgroup. - Cut devices in /proc/partitions and /proc/devices if have no read or write permission for them. - Cut devices in sys_ustat() if have no read permi

[Devel] [PATCH RH9 07/11] ve/net/vxlan: enable support in a container

2021-09-23 Thread Alexander Mikhalitsyn
From: Konstantin Khorenko vxlan is safe in CT as: 1) Udp multicast socket to connect to outer word sits in creation net- namespace, and these socket can get packets only forwarded/routed in creation ns. 2) Vxlan device is owned by second netns(could be same as first) as any other network device

[Devel] [PATCH RH9 06/11] ve/net: ip_vti: skip per net init in ve

2021-09-23 Thread Alexander Mikhalitsyn
From: Vladimir Davydov ip_vti devices lack NETIF_F_VIRTUAL, so they can't be created inside a container. Problem is a device of this kind is created on net ns init if the module is loaded, as a result a container start fails with EPERM. We could allow ip_vti inside container (as well as other ne

[Devel] [PATCH RH9 09/11] ve/kobj: Send events per VE instead of all net-namespaces broadcasting

2021-09-23 Thread Alexander Mikhalitsyn
From: Stanislav Kinsburskiy Currently uevents are sending broadcastly to all net-namespaces present in the system which is leading to problem of C/R'ing systemd based containers (netlink socket sees data from the node and we can't dump until the data is read). So let's send broadcast events to ne

[Devel] [PATCH RH9 03/11] ve/net/cred: add ve_capable to check capabilities relative to the current VE (v2)

2021-09-23 Thread Alexander Mikhalitsyn
From: Konstantin Khorenko We want to allow a few operations in VE. Currently we use nsown_capable, but it's wrong, because in this case we allow these operations in any user namespace. v2: take ve0->cred if the currect ve isn't running https://jira.sw.ru/browse/PSBM-39077 Signed-off-by: Andrew

[Devel] [PATCH RH9 11/11] ve/quota: allow to manage quota in top CT user ns

2021-09-23 Thread Alexander Mikhalitsyn
From: Konstantin Khorenko CT owner (and Virtuozzo tools like vzctl, prlctl) should be able to manage 2nd level quota - quota inside the CT => allow management in top CT user namespace. https://jira.sw.ru/browse/PSBM-40281 Signed-off-by: Konstantin Khorenko https://jira.sw.ru/browse/PSBM-1339

[Devel] [PATCH RH9 05/11] ve/net/ip_gre: containerize per-net devices

2021-09-23 Thread Alexander Mikhalitsyn
From: Kirill Tkhai Port patch diff-ve-net-ip_gre-containerize-per-net-devices from 2.6.32: This patch adds IP GRE devices support in a container. Done in the scope of https://jira.sw.ru/browse/PSBM-24331 Signed-off-by: Kirill Tkhai Signed-off-by: Stanislav Kinsbursky https://jira.sw.ru/brow

[Devel] [PATCH RH9 08/11] ve/uevent: Use own uevent_seqnum for every VE

2021-09-23 Thread Alexander Mikhalitsyn
From: Konstantin Khorenko https://jira.sw.ru/browse/PSBM-17903 Signed-off-by: Kirill Tkhai https://jira.sw.ru/browse/PSBM-133986 (cherry picked from commit 739915b241c792831b89f87bb6260da8a6a515e7) Signed-off-by: Alexander Mikhalitsyn --- include/linux/ve.h | 6 ++ kernel/ksysfs.c

[Devel] [PATCH RH9 10/11] ve/net: introduce vz_security_*_check checks

2021-09-23 Thread Alexander Mikhalitsyn
From: Stanislav Kinsburskiy Signed-off-by: Konstantin Khlebnikov +++ ve/netlink: allow messages with family PF_BRIDGE type RTM_xxxNEIGH in CT While reproducing the problem mentioned in patch 1 I found that we need it to be able to configure vxlan fdb (Forwarding Database entry). https://jira.

[Devel] [PATCH RHEL7 COMMIT] ploop: Fix partition refcounter leak

2021-09-23 Thread Vasily Averin
The commit is pushed to "branch-rh7-3.10.0-1160.41.1.vz7.183.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-1160.41.1.vz7.183.4 --> commit 2b8d9cf32c04e2bf3eab299234b76773f424db9e Author: Konstantin Khorenko Date: Thu Sep 23 18:15:28 2021 +0300 p

[Devel] [PATCH RHEL8 COMMIT] ms/memcg: prohibit unconditional exceeding the limit of dying tasks

2021-09-23 Thread Konstantin Khorenko
The commit is pushed to "branch-rh8-4.18.0-305.3.1.vz8.7.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh8-4.18.0-305.3.1.vz8.7.13 --> commit ded8b4372053eb4abb997daf86dc3ac39f31465d Author: Vasily Averin Date: Thu Sep 23 19:26:48 2021 +0300 ms/memcg: proh

[Devel] [PATCH RHEL8 COMMIT] dm: Add dm-tracking target

2021-09-23 Thread Konstantin Khorenko
The commit is pushed to "branch-rh8-4.18.0-305.3.1.vz8.7.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh8-4.18.0-305.3.1.vz8.7.13 --> commit 4b7647f44f3097a068b7693dc2d3b06b7943c644 Author: Kirill Tkhai Date: Thu Sep 23 19:31:14 2021 +0300 dm: Add dm-trac

[Devel] [PATCH RHEL8 COMMIT] configs: Build dm-tracking module

2021-09-23 Thread Konstantin Khorenko
The commit is pushed to "branch-rh8-4.18.0-305.3.1.vz8.7.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh8-4.18.0-305.3.1.vz8.7.13 --> commit b1e57da042a504286f0930db48b7441d39b63dc7 Author: Konstantin Khorenko Date: Thu Sep 23 19:47:42 2021 +0300 configs:

[Devel] [PATCH RHEL8 COMMIT] push_backup: Remove suspended check

2021-09-23 Thread Konstantin Khorenko
The commit is pushed to "branch-rh8-4.18.0-305.3.1.vz8.7.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh8-4.18.0-305.3.1.vz8.7.13 --> commit 20578202eb6dc164d8118cc25c3703331250baa2 Author: Kirill Tkhai Date: Thu Sep 23 19:54:39 2021 +0300 push_backup: Re

[Devel] [PATCH RH9 03/33] ve/mm, oom: print information about ve of killed task

2021-09-23 Thread Andrey Zhadchenko
From: Andrey Ryabinin dmesg |grep Killed Before: Killed process 14892 (trinity-c271) total-vm:97920kB, anon-rss:2508kB, file-rss:1060kB After: Killed process 14892 (trinity-c271) in ve 4 total-vm:97920kB, anon-rss:2508kB, file-rss:1060kB https://jira.sw.ru/browse/PSBM-40610 Si

[Devel] [PATCH RH9 05/33] exit: clear TIF_MEMDIE after exit_task_work

2021-09-23 Thread Andrey Zhadchenko
From: Vladimir Davydov An mm_struct may be pinned by a file. An example is vhost-net device created by a qemu/kvm (see vhost_net_ioctl -> vhost_net_set_owner -> vhost_dev_set_owner). If such process gets OOM-killed, the reference to its mm_struct will only be released from exit_task_work -> f

[Devel] [PATCH RH9 04/33] oom: do not dump all tasks info on each oom kill

2021-09-23 Thread Andrey Zhadchenko
From: Vladimir Davydov It will only spam dmesg on oom inside a container, so disable it by default as it used to be in PCS6. Signed-off-by: Vladimir Davydov Reviewed-by: Kirill Tkhai (cherry picked from vz8 commit 34a56ac2c3bd06751e83f66d9ac2e9f13cc2515a) Signed-off-by: Andrey Zhadchenko ---

[Devel] [PATCH RH9 01/33] rh/lib/cpumask: Make CPUMASK_OFFSTACK usable without debug dependency

2021-09-23 Thread Andrey Zhadchenko
From: Josh Boyer When CPUMASK_OFFSTACK was added in 2008, it was dependent upon DEBUG_PER_CPU_MAPS being enabled, or an architecture could select it. The debug dependency adds additional overhead that isn't required for operation of the feature, and we need CPUMASK_OFFSTACK to increase the NR_CPU

[Devel] [PATCH RH9 26/33] ve/uts_ns: Implement cgroup interface to configure ve's os_release

2021-09-23 Thread Andrey Zhadchenko
From: Kirill Tkhai It's the similar to VZCTL_VE_CONFIGURE ioctl in PCS6. Note: max_write_len is __NEW_UTS_LEN + 1, because I want to allow echo ... > ve.os_release, which adds trailing '\n' to the string (see man echo for details). Extra symbol will be cut in ve_os_release_write(). https://ji

[Devel] [PATCH RH9 10/33] kernel/freezer: don't freeze stopped & about to be ptraced task

2021-09-23 Thread Andrey Zhadchenko
From: Andrey Ryabinin So, the ptrace() hangs if we try to attach to stopped task from freezing cgroup: Tracee: Tracer: static bool do_signal_stop(int signr) __set_current_state(TASK_STOPPED); freezable_schedule(); freezer_do_

[Devel] [PATCH RH9 17/33] vzstat: Update sched lat in vzmon

2021-09-23 Thread Andrey Zhadchenko
From: Kirill Tkhai FIXME: This patch must be rewritten via sched_entity::statistics::wait_max. Signed-off-by: Vladimir Davydov Signed-off-by: Kirill Tkhai (cherry picked from vz8 commit 85d40e62bfe81cbaba338b7ecb5fc1377a3f2d7b) Signed-off-by: Andrey Zhadchenko --- kernel/ve/vzstat.c | 65 ++

[Devel] [PATCH RH9 11/33] vzstat: Add base kstat structures and variables

2021-09-23 Thread Andrey Zhadchenko
From: Kirill Tkhai Signed-off-by: Vladimir Davydov Signed-off-by: Kirill Tkhai +++ ve/vzstat.h: move some kstat definitions into new header Move some definitions into kstat.h, so we could use later in other headers (sched.h) https://jira.sw.ru/browse/PSBM-81395 Signed-off-by: Andrey Ryabinin

[Devel] [PATCH RH9 28/33] ve: Allow taskstats via netlink in netns

2021-09-23 Thread Andrey Zhadchenko
From: Vladimir Davydov In fact this patch was in 084.9 already, but was committed in kernel-ve, so this is a move from kernel-ve to kernel-patches. From: Konstantin Khlebnikov Date: Thu, 12 Sep 2013 15:25:31 +0400 Subject: [vzlin-dev] [PATCH rh6] ve: allow taskstats netlink in netns This patch

[Devel] [PATCH RH9 12/33] mm: Export swap_cache_info struct and variable

2021-09-23 Thread Andrey Zhadchenko
From: Kirill Tkhai This will be used by vzstat. Signed-off-by: Kirill Tkhai (cherry picked from vz8 commit 7f795f34eebf25d87c638a27ba0cb68d307f72de) Signed-off-by: Andrey Zhadchenko --- include/linux/swap.h | 9 + mm/swap_state.c | 8 ++-- 2 files changed, 11 insertions(+),

[Devel] [PATCH RH9 31/33] ve/itimer: add ve_name to warning for a NULL new_value

2021-09-23 Thread Andrey Zhadchenko
From: Dmitry Safonov The host admin may be confused by warning in dmesg with only "comm", which may be anything a user in a container chooses. Add ve name to this warning. https://jira.sw.ru/browse/PSBM-49818 Signed-off-by: Dmitry Safonov Signed-off-by: Stanislav Kinsburskiy (cherry picked f

[Devel] [PATCH RH9 00/33] port part 5

2021-09-23 Thread Andrey Zhadchenko
046f7c4bcc5a ("ve/mm: print OOM info to VE log") will be sent later as it depends on part 3 Andrew Vagin (1): ve/fs: add ve_capable to check capabilities relative to the current VE Andrey Ryabinin (3): ve/mm,oom: print information about ve of killed task proc,memcg: use memcg limits for sh

[Devel] [PATCH RH9 22/33] ve/fanotify: Use ve-capable instead of plain capable test

2021-09-23 Thread Andrey Zhadchenko
From: Cyrill Gorcunov To create fanotify objects one have to be sysadmin of a container. The main potential problem is unlimited number of marks and queue, but since it uses kmem cgroup to obtain objects this should be controllable via memory cgroup settings. https://jira.sw.ru/browse/PSBM-41409

[Devel] [PATCH RH9 16/33] kernel: Export tasklist_lock

2021-09-23 Thread Andrey Zhadchenko
From: Kirill Tkhai Needed for our modules (vzstat). FIXME: maybe reduce to EXPORT_SYMBOL_GPL(). Signed-off-by: Kirill Tkhai (cherry picked from vz8 commit 88e9f6757a92d2687aae51cc63ab8d6c4e1d60cf) Signed-off-by: Andrey Zhadchenko --- kernel/fork.c | 1 + 1 file changed, 1 insertion(+) diff

[Devel] [PATCH RH9 33/33] ve/proc/net: virtualize all the network proc entries

2021-09-23 Thread Andrey Zhadchenko
From: Stanislav Kinsburskiy Directories should be created via proc_net_mkdir(), files via proc_net_create() or proc_net_create_data() to become visible in a container. Signed-off-by: Konstantin Khlebnikov Signed-off-by: Stanislav Kinsburskiy khorenko@: TODO: helpers are to be renamed. Rebase

[Devel] [PATCH RH9 21/33] ve/fs/locks: Make CAP_LEASE work in containers

2021-09-23 Thread Andrey Zhadchenko
From: Evgenii Shatokhin Allowing the privileged processes in the containers to set leases on arbitrary files seems to make no harm. Let us make CAP_LEASE work there. https://jira.sw.ru/browse/PSBM-46199 Signed-off-by: Evgenii Shatokhin Acked-by: Cyrill Gorcunov (cherry picked from vz8 commit

[Devel] [PATCH RH9 23/33] ve/fs/namei: fix capabilities check in sys_renameat2 () to support Containers

2021-09-23 Thread Andrey Zhadchenko
From: Stanislav Kinsburskiy Signed-off-by: Stanislav Kinsburskiy Rebased to vz9: - ve_capable is moved to vfs_mknod becase due to ms commit a3c751a50fe6 ("vfs: allow unprivileged whiteout creation") (cherry picked from commit vz8 ea5765973b0087b555d608622b4ad6a676395b23) Signed-off-by: Andrey

[Devel] [PATCH RH9 20/33] VE/FS: containerize filesystems access

2021-09-23 Thread Andrey Zhadchenko
From: Stanislav Kinsburskiy Disable non-virtualized file systems in containers. This patch contains of two logical parts: 1) Filter out non-containerized filesystems output for "/proc/filesystems". 2) Forbid access to fs structure if current VE is not super and filesystem is not containerized.

[Devel] [PATCH RH9 27/33] ve/netlink: allow IPVS netlink messages to CT init userns

2021-09-23 Thread Andrey Zhadchenko
From: Pavel Tikhomirov Docker swarm tries to setup IPVS via netlink, so let him do it. It can be done through setsockopt as each IPVS_CMD_* had it's IP_VS_SO_SET_* analogy. https://jira.sw.ru/browse/PSBM-63883 Signed-off-by: Pavel Tikhomirov Reviewed-by: Andrew Vagin (cherry picked from vz8 c

[Devel] [PATCH RH9 14/33] vzstat: Add vzstat module and kstat interfaces

2021-09-23 Thread Andrey Zhadchenko
From: Kirill Tkhai Signed-off-by: Vladimir Davydov Signed-off-by: Kirill Tkhai +++ vzstat: account cpu total time properly in mm performance stats /proc/vz/mmperf occasionally accounts/shows wall total time in both "Wall_tot_time" and "CPU_tot_time" columns, fix this. mFixes: c0a20dd32be6 ("

[Devel] [PATCH RH9 29/33] ve/taskstats: allow delivery of task attributes in CT context

2021-09-23 Thread Andrey Zhadchenko
From: Stanislav Kinsburskiy This is needed to make iotop work in container: https://jira.sw.ru/browse/PSBM-56171 Signed-off-by: Stanislav Kinsburskiy (cherry picked from vz8 commit 288ad5c98b3bb97ba30c427989902f4173748441) Signed-off-by: Andrey Zhadchenko --- kernel/taskstats.c | 2 +- 1 fi

[Devel] [PATCH RH9 30/33] ve/lockdep: Taint kernel on circular locking complains

2021-09-23 Thread Andrey Zhadchenko
From: Vladimir Davydov Author: Konstantin Khorenko Email: khore...@parallels.com Subject: lockdep: taint kernel on circular locking complains Date: Mon, 23 Sep 2013 20:04:29 +0400 Currently our QA infrastructure cannot efficiently grep kernel logs and analyze them thus lockdep complains are lost

[Devel] [PATCH RH9 18/33] fs/ve: add new FS_VE_MOUNT flag to allow mount in container init userns

2021-09-23 Thread Andrey Zhadchenko
From: Stanislav Kinsburskiy This patch is a part of vz7 commit 4e8e69eb16b1 ("fs/ve: add new FS_VE_MOUNT flag to allow mount in container init userns") Some filesystems are allowed to be mounted only in init userns in mainstream/rh kernel. And some of those we still would like to mount in Contai

[Devel] [PATCH RH9 09/33] oom: make berserker more aggressive

2021-09-23 Thread Andrey Zhadchenko
From: Vladimir Davydov In the berserker mode we kill a bunch of tasks that are as bad as the selected victim. We assume two tasks to be equally bad if they consume the same permille of memory. With such a strict check, it might turn out that oom berserker won't kill any tasks in case a fork bomb

[Devel] [PATCH RH9 32/33] proc/net: proc_net_*() helpers introduced

2021-09-23 Thread Andrey Zhadchenko
From: Stanislav Kinsburskiy Will be used to make per-net sysfs files visible in CT. Signed-off-by: Stanislav Kinsburskiy Rebase to RHEL8 beta kernel notes: more helpers added: proc_net_{seq*,single*,net*} TODO: rename helpers - proc_net_create_net() does not look good. Signed-off-by: Konsta

[Devel] [PATCH RH9 13/33] mm: Export first_online_pgdat() and next_online_pgdat()

2021-09-23 Thread Andrey Zhadchenko
From: Kirill Tkhai This is need for vzstat. Signed-off-by: Kirill Tkhai (cherry picked from vz8 commit a8d3db1b7cd6f785fcc23255e5243fdfd905fcb3) Signed-off-by: Andrey Zhadchenko --- mm/mmzone.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/mmzone.c b/mm/mmzone.c index eb89d6e..9c4

[Devel] [PATCH RH9 08/33] oom: resurrect berserker mode

2021-09-23 Thread Andrey Zhadchenko
From: Vladimir Davydov Feature: oom: berserker mode The logic behind the OOM berserker is the same as in PCS6: if processes are killed by oom killer too often (< sysctl vm.oom_relaxation, 1 sec by default), we increase "rage" (min -10, max 20) and kill 1 << "rage" youngest worst processes if "ra

[Devel] [PATCH RH9 25/33] ve/block: add ve_capable to check capabilities relative to the current VE

2021-09-23 Thread Andrey Zhadchenko
From: Stanislav Kinsburskiy We want to allow a few operations in VE. Currently we use nsown_capable, but it's wrong, because in this case we allow these operations in any user namespace. https://jira.sw.ru/browse/PSBM-39077 Signed-off-by: Andrew Vagin Signed-off-by: Stanislav Kinsburskiy (ch

[Devel] [PATCH RH9 15/33] vzstat,sched: Track sched_lat_ve

2021-09-23 Thread Andrey Zhadchenko
From: Kirill Tkhai FIXME: This patch must be rewritten via sched_entity::statistics::wait_max. Signed-off-by: Vladimir Davydov Signed-off-by: Kirill Tkhai (cherry picked from vz8 commit c0fb61430afceb6564a1b0792d51835b70eab9e1) Signed-off-by: Andrey Zhadchenko --- include/linux/ve.h | 3 +

[Devel] [PATCH RH9 02/33] memcg: do not allow to disable oom from inside a container

2021-09-23 Thread Andrey Zhadchenko
From: Vladimir Davydov It is possible to disable oom killer inside a memory cgroup by writing 1 to memory.oom_control. If a process inside such a cgroup hits the memory limit and is unable to reclaim anything, it will wait until more memory becomes available. This operation shouldn't be allowed

[Devel] [PATCH RH9 24/33] ve/fs: add ve_capable to check capabilities relative to the current VE

2021-09-23 Thread Andrey Zhadchenko
From: Andrew Vagin We want to allow a few operations in VE. Currently we use nsown_capable, but it's wrong, because in this case we allow these operations in any user namespace. https://jira.sw.ru/browse/PSBM-39077 Signed-off-by: Andrew Vagin Signed-off-by: Stanislav Kinsburskiy khorenko@: r

[Devel] [PATCH RH9 07/33] proc, memcg: use memcg limits for showing oom_score inside CT

2021-09-23 Thread Andrey Zhadchenko
From: Andrey Ryabinin Use memcg's limits of task to show /proc//oom_score. Note: in vz7 we had different behavior. It showed 'oom_score' based on 've->memcg' limits of process reading oom_score. Now we look at memcg of process and don't care about the current one. It seems more correct behavio

[Devel] [PATCH RH9 19/33] fs: Mask appropriate filesystems FS_VIRTUALIZED

2021-09-23 Thread Andrey Zhadchenko
From: Kirill Tkhai Extracted from "Initial patch". Signed-off-by: Kirill Tkhai Signed-off-by: Stanislav Kinsburskiy +++ ve/fs/autofs: Allow autofs to be used inside a container It turned out that autofs is used at least for NFS/CIFS and binfmt_misc. Let's use new FS_VE_MOUNT flag to only al

[Devel] [PATCH RH9 06/33] memcg: add oom_guarantee

2021-09-23 Thread Andrey Zhadchenko
From: Vladimir Davydov Feature: mm: OOM guarantee This patch description: OOM guarantee works exactly like low limit, but for OOM, i.e. tasks inside cgroups above the limit are killed first. Read/write via memory.oom_guarantee. Signed-off-by: Vladimir Davydov Signed-off-by: Andrey Ryabinin