In criu we do: +-> restore_one_alive_task +-> set_user_ns #1
+-> restore_one_alive_task +-> sigreturn_restore #2 +-> arch_export_restore_task +-> __export_restore_task +-> sys_prctl(PR_SET_MM, PR_SET_MM_MAP,...) So we call PR_SET_MM after we've switched to unprivileged userns, but PR_SET_MM_MAP is already available in unprivileged context. In case of fallback where PR_SET_MM_MAP is not available there would be a problem, but on our kernel we have it so criu should just work fine. In spfs we do PR_SET_MM + PR_SET_MM_EXE_FILE from parasite (can be unprivileged userns). PR_SET_MM_EXE_FILE one is not available in mainstream. Here are descriptions of patches which allowed PR_SET_MM_EXE_FILE everywhere and all other PR_SET_MM flags in ve: +++ ve/prctl_set_mm: allow to change mm content in ve This is required to be able to change /proc/pid/exe of a process, running on NFS. SPFS manager, which does this change, is a child of criu process, which is being started in container from the early beginning. https://jira.sw.ru/browse/PSBM-26967 Signed-off-by: Stanislav Kinsburskiy <skinsbur...@virtuozzo.com> (cherry picked from vz8 commit 850d71b3cebc0796b87d45659c832d44234328d6) Signed-off-by: Pavel Tikhomirov <ptikhomi...@virtuozzo.com> +++ prctl: reduce requirements to exe link change Do not request for CAP_SYS_RESOURCE anymore to change exe link. This is needed to allow spfs manager to change it in unprivileged process. In case of CRIU this restriction wasn't a problem, since CRIU is a priviledged process and drops capabilities _after_ exe link change. But then spfs manager is not able to do the same thing for unpriviledged process. We are not going to push NFS to upstream anymore. And thus can relax requirements in our kernel. Note: this limitation is somewhat strange, because exe link can be changed upon execve system call. https://jira.sw.ru/browse/PSBM-50867 Signed-off-by: Stanislav Kinsburskiy <skinsbur...@virtuozzo.com> Acked-by: Konstantin Khorenko <khore...@virtuozzo.com> khorenko@: this allows to migrate online unprivileged processes which binaries lay on an NFS volume. (cherry picked from commit 4737d188f94f05eb58e770c040f64f1fa49efbce) Signed-off-by: Pavel Tikhomirov <ptikhomi...@virtuozzo.com> Let's make it more restrictive and only allow PR_SET_MM_EXE_FILE which seem the only thing we actually need here. https://jira.sw.ru/browse/PSBM-133993 Signed-off-by: Pavel Tikhomirov <ptikhomi...@virtuozzo.com> --- kernel/sys.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/sys.c b/kernel/sys.c index b0caeed760bd..fedc4d14b1af 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2159,12 +2159,12 @@ static int prctl_set_mm(int opt, unsigned long addr, return prctl_set_mm_map(opt, (const void __user *)addr, arg4); #endif - if (!capable(CAP_SYS_RESOURCE)) - return -EPERM; - if (opt == PR_SET_MM_EXE_FILE) return prctl_set_mm_exe_file(mm, (unsigned int)addr); + if (!capable(CAP_SYS_RESOURCE)) + return -EPERM; + if (opt == PR_SET_MM_AUXV) return prctl_set_auxv(mm, addr, arg4); -- 2.31.1 _______________________________________________ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel