We get crash on a container start: [ 120.818119] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:201 [ 120.819805] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 3299, name: vzctl [ 120.821504] INFO: lockdep is turned off. [ 120.823206] irq event stamp: 6130 [ 120.824901] hardirqs last enabled at (6129): [<ffffffff93e00cc2>] asm_sysvec_apic_timer_interrupt+0x12/0x20 [ 120.826710] hardirqs last disabled at (6130): [<ffffffff93d1d9f9>] _raw_spin_lock_irq+0x59/0x80 [ 120.828525] softirqs last enabled at (6128): [<ffffffff940005e5>] __do_softirq+0x5e5/0x939 [ 120.830350] softirqs last disabled at (6119): [<ffffffff91df9d79>] irq_exit_rcu+0x1e9/0x240 [ 120.832178] CPU: 1 PID: 3299 Comm: vzctl ve: 200 Tainted: G W --------- --- 5.14.0-4.vz9.10.17+debug #1 10.17 [ 120.834095] Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.4 04/01/2014 [ 120.836022] Call Trace: [ 120.837930] dump_stack_lvl+0x56/0x7b ... [ 120.839851] ___might_sleep.cold.155+0x140/0x16e [ 120.841792] __kmalloc_track_caller+0x289/0x2f0 [ 120.843730] kstrdup+0x2d/0x50 [ 120.845665] __kernfs_new_node+0xaa/0x780 ... [ 120.861205] kernfs_new_node+0x77/0x140 [ 120.863134] __kernfs_create_file+0x37/0x2e0 [ 120.865064] cgroup_add_file+0x1a4/0x4e0 [ 120.872775] cgroup_mark_ve_roots+0x4e6/0x6a0 [ 120.874697] ve_state_write+0x8b8/0xc50 [ 120.876630] cgroup_file_write+0x40c/0x930 ... [ 120.884444] kernfs_fop_write_iter+0x2d3/0x490 [ 120.886384] new_sync_write+0x3b2/0x620 [ 120.893982] vfs_write+0x4b5/0x8f0 [ 120.895769] ksys_write+0xf1/0x1c0
The problem here is that cgroup_add_file() function performs GPF_KERNEL allocations and we call it under spin_lock. After discussion about the problem with Pavel we decided to release css_set_lock spinlock. It looks safe because, as we can see, all cases where we removing elements from ->cgrp_links "protected" by cset refcnt which is at least one because we holding cgroup namespace connected to our VE. Fixes: ee84e05864f9a ("ve/cgroup: Set release_agent_path for root cgroups separately") https://jira.sw.ru/browse/PSBM-135190 Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalit...@virtuozzo.com> --- v2: commit message/code comment fixes --- kernel/cgroup/cgroup.c | 24 +++++++++++++++++++++++- 1 file changed, 23 insertions(+), 1 deletion(-) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 396c0dc98b64..116a664a0f5d 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2083,6 +2083,29 @@ int cgroup_mark_ve_roots(struct ve_struct *ve) return -EINVAL; } + /* + * We want to traverse the cset->cgrp_links list and + * call cgroup_add_file() function which will call + * memory allocation functions with GFP_KERNEL flag. + * We can't continue to hold css_set_lock, but it's + * safe to hold cgroup_mutex. + * + * It's safe to hold only cgroup_mutex and traverse + * the cset list just because in all scenarious where + * the ->cgrp_links list is getting modified: + * + * 1. cgroup_destroy_root + * this function is called only during css destruction + * + * 2. put_css_set_locked + * here we protected by !refcount_dec_and_test(&cset->refcount) + * check which prevents any list modifications if someone is still + * holding the refcnt to cset. + * See copy_cgroup_ns() function it's taking refcnt's by get_css_set(). + * + */ + spin_unlock_irq(&css_set_lock); + list_for_each_entry(link, &cset->cgrp_links, cgrp_link) { cgrp = link->cgrp; @@ -2108,7 +2131,6 @@ int cgroup_mark_ve_roots(struct ve_struct *ve) } link_ve_root_cpu_cgroup(cset->subsys[cpu_cgrp_id]); - spin_unlock_irq(&css_set_lock); mutex_unlock(&cgroup_mutex); synchronize_rcu(); -- 2.31.1 _______________________________________________ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel