------- Comment From s...@de.ibm.com 2020-07-23 11:06 EDT------- Hi, I was able to reproduce the "make: echo: Operation not permitted" on my Ubuntu 20.04 s390x machine. I've installed build and installed the mentioned make-dfsg_4.3-4ubuntu1 package without the "--disable-posix-spawn" configure flag. I've build flatpak-builder_1.0.11-1 which executes the test which is triggering the "Operation not permitted".
Then I've adjusted the tests, thus I can also run them without building the package itself. This test runs flatpak-builder which prepares some stuff (e.g. a root-directory with all needed files / binaries / libraries). flatpak-builder then creates a container with bwrap and calls a configure skript, which generates a Makefile. In a second invocation, make is invoked. I've adjusted the configure script which now executed an own small program. This program is first waiting some time, which I use to deterine its PID. Then I can either attach strace or gdb. After the timeout, the program just execve's to make. Thus in the end I have a process-chain like: flatpak-builder--bwrap---bwrap---configure---make The strace output shows, that the clone syscall is failing with EPERM: 4269 17:08:47.914142 stat("/usr/bin/echo", {st_mode=S_IFREG|0755, st_size=39136, ...}) = 0 <0.000003> 4270 17:08:47.914155 geteuid() = 1001 <0.000001> 4271 17:08:47.914167 getegid() = 1001 <0.000002> 4272 17:08:47.914175 getuid() = 1001 <0.000001> 4273 17:08:47.914182 getgid() = 1001 <0.000001> 4274 17:08:47.914189 access("/usr/bin/echo", X_OK) = 0 <0.000005> 4275 17:08:47.914203 mmap(NULL, 36864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x3ff9c86b000 <0.000002> 4276 17:08:47.914214 rt_sigprocmask(SIG_BLOCK, ~[], [HUP INT QUIT TERM CHLD XCPU XFSZ], 8) = 0 <0.000001> 4277 17:08:47.914224 clone(child_stack=0x3ff9c874000, flags=CLONE_VM|CLONE_VFORK|SIGCHLD) = -1 EPERM (Operation not permitted) <0.000001> 4278 17:08:47.914235 munmap(0x3ff9c86b000, 36864) = 0 <0.000004> 4279 17:08:47.914245 rt_sigprocmask(SIG_SETMASK, [HUP INT QUIT TERM CHLD XCPU XFSZ], NULL, 8) = 0 <0.000001> A gdb session showed that posix_spawn is called by make like that (Info: make is using vfork() if configured with "--disable-posix-spawn"): jobs.c:child_execute_job (struct childbase *child, int good_stdin, char **argv) posix_spawnattr_t attr; posix_spawn_file_actions_t fa; short flags = 0; posix_spawnattr_init (&attr) posix_spawn_file_actions_init (&fa) flags |= POSIX_SPAWN_SETSIGMASK; => 0x08 flags |= POSIX_SPAWN_USEVFORK; => 0x40 fdin=0, fdout=1, fderr=2 flags |= POSIX_SPAWN_RESETIDS; => 0x01 => flags = 0x49 posix_spawnattr_setflags (&attr, flags) /* Start the program. */ while ((r = posix_spawn (&pid, cmd, &fa, &attr, argv, child->environment)) == EINTR) ; In glibc, the posix_spawn is doing this: posix_spawn(...) -> __spawni(..., 0) -> __spawnix(..., __execve) void *stack = __mmap (NULL, stack_size, prot, MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0); /* Disable asynchronous cancellation. */ __libc_signal_block_all (&args.oldmask); # define CLONE(__fn, __stack, __stacksize, __flags, __args) \ __clone (__fn, __stack, __flags, __args) new_pid = CLONE (__spawni_child, STACK (stack, stack_size), stack_size, CLONE_VM | CLONE_VFORK | SIGCHLD, &args); => __clone (__spawni_child, __stack, CLONE_VM | CLONE_VFORK | SIGCHLD, &args); <glibc-src>/sysdeps/unix/sysv/linux/s390/s390-64/clone.S (gdb) i r r2 r3 r4 r5 r6 r2 0x3ffb22f53c0 4396740989888 r3 0x3ffb24f4000 4396743081984 r4 0x4111 16657 r5 0x3ffce97c9e0 4397217597920 r6 0xffffffffffffffff 18446744073709551615 ? >0x3ffb2306760 <clone> stg %r6,48(%r15) ? 0x3ffb2306766 <clone+6> lgr %r0,%r5 ? 0x3ffb230676a <clone+10> ltgr %r1,%r2 ? 0x3ffb230676e <clone+14> je 0x3ffb23067a6 <clone+70> ? 0x3ffb2306772 <clone+18> ltgr %r2,%r3 ? 0x3ffb2306776 <clone+22> je 0x3ffb23067a6 <clone+70> ? 0x3ffb230677a <clone+26> lgr %r3,%r4 ? 0x3ffb230677e <clone+30> lgr %r4,%r6 ? 0x3ffb2306782 <clone+34> lg %r5,168(%r15) ? 0x3ffb2306788 <clone+40> lg %r6,160(%r15) (gdb) i r r1 r2 r3 r4 r5 r6 r1 0x3ffb22f53c0 4396740989888 r2 0x3ffb24f4000 4396743081984 r3 0x4111 16657 # define CLONE_VM 0x00000100 /* Set if VM shared between processes. */ # define CLONE_VFORK 0x00004000 /* Set if the parent wants the child to wake it up on mm_release. */ <glibc-src>/sysdeps/unix/sysv/linux/bits/signum.h:41:#define SIGCHLD 17 => 0x11 r4 0xffffffffffffffff 18446744073709551615 r5 0x3ffce97c960 4397217597792 r6 0x0 0 /* sys_clone (void *child_stack, unsigned long flags, pid_t *parent_tid, pid_t *child_tid, void *tls); */ ? 0x3ffb230678e <clone+46> svc 120 ? => sys_clone is returning EPERM instead of succeeding and jumping to __spawni_child(). At this time, the make process has those opened files: find /proc/273963/fdinfo -type f -printf "\ncat %p\n" -exec cat {} \; cat /proc/273963/fdinfo/0 pos: 0 flags: 0100000 mnt_id: 25 cat /proc/273963/fdinfo/1 pos: 0 flags: 02001 mnt_id: 14 cat /proc/273963/fdinfo/2 pos: 0 flags: 02001 mnt_id: 14 ls -la /proc/273963/fd/* lr-x------ 1 stli stli 64 Jul 23 10:31 /proc/273963/fd/0 -> /dev/null l-wx------ 1 stli stli 64 Jul 23 10:35 /proc/273963/fd/1 -> 'pipe:[661251]' l-wx------ 1 stli stli 64 Jul 23 10:35 /proc/273963/fd/2 -> 'pipe:[661251]' A workmate of mine gave me a hint, that he had a similar issue with podman containers where a seccomp filter was applied. Thus I've used https://github.com/david942j/seccomp-tools with a private patch from my workmate which enables s390x support. And indeed, there is a seccomp filter applied for the second bwrap-process and its childs: line CODE JT JF K ================================= 0000: 0x20 0x00 0x00 0x00000004 A = arch 0001: 0x15 0x00 0x1f 0x80000016 if (A != ARCH_S390X) goto 0033 0002: 0x20 0x00 0x00 0x00000000 A = sys_number 0003: 0x15 0x1c 0x00 0x00000015 if (A == mount) goto 0032 0004: 0x15 0x1b 0x00 0x00000033 if (A == acct) goto 0032 0005: 0x15 0x1a 0x00 0x00000056 if (A == uselib) goto 0032 0006: 0x15 0x19 0x00 0x00000067 if (A == syslog) goto 0032 0007: 0x15 0x18 0x00 0x00000083 if (A == quotactl) goto 0032 0008: 0x15 0x17 0x00 0x000000d9 if (A == pivot_root) goto 0032 0009: 0x15 0x16 0x00 0x0000010c if (A == mbind) goto 0032 0010: 0x15 0x15 0x00 0x0000010d if (A == get_mempolicy) goto 0032 0011: 0x15 0x14 0x00 0x0000010e if (A == set_mempolicy) goto 0032 0012: 0x15 0x13 0x00 0x00000116 if (A == add_key) goto 0032 0013: 0x15 0x12 0x00 0x00000117 if (A == request_key) goto 0032 0014: 0x15 0x11 0x00 0x00000118 if (A == keyctl) goto 0032 0015: 0x15 0x10 0x00 0x0000011f if (A == migrate_pages) goto 0032 0016: 0x15 0x0f 0x00 0x0000012f if (A == unshare) goto 0032 0017: 0x15 0x0e 0x00 0x00000136 if (A == move_pages) goto 0032 0018: 0x15 0x00 0x05 0x00000036 if (A != ioctl) goto 0024 # => for clone, we goto 0024 0019: 0x20 0x00 0x00 0x00000018 A = cmd # ioctl(fd, cmd, arg) 0020: 0x54 0x00 0x00 0x00000000 A &= 0x0 0021: 0x15 0x00 0x09 0x00000000 if (A != 0) goto 0031 0022: 0x20 0x00 0x00 0x0000001c A = cmd >> 32 # ioctl(fd, cmd, arg) 0023: 0x15 0x08 0x07 0x00005412 if (A == 0x5412) goto 0032 else goto 0031 0024: 0x15 0x00 0x06 0x00000078 if (A != clone) goto 0031 # => all other syscalls are allowed, but clone is handled here 0025: 0x20 0x00 0x00 0x00000010 A = clone_flags # clone(clone_flags, newsp, parent_tidptr, child_tidptr, tls) 0026: 0x54 0x00 0x00 0x00000000 A &= 0x0 0027: 0x15 0x00 0x03 0x00000000 if (A != 0) goto 0031 # => the previous check seems to be a nop 0028: 0x20 0x00 0x00 0x00000014 A = clone_flags >> 32 # clone(clone_flags, newsp, parent_tidptr, child_tidptr, tls) # => The flags of clone are checked: 0029: 0x54 0x00 0x00 0x10000000 A &= 0x10000000 # define CLONE_NEWUSER 0x10000000 = 268435456 /* New user namespace. */ 0030: 0x15 0x01 0x00 0x10000000 if (A == 268435456) goto 0032 # => ERRNO(1) which is EPERM 0031: 0x06 0x00 0x00 0x7fff0000 return ALLOW 0032: 0x06 0x00 0x00 0x00050001 return ERRNO(1) 0033: 0x06 0x00 0x00 0x00000000 return KILL Unfortunately the order of arguments for clone syscall on s390x differs compared to x86_64! => The filter is checking the first argument which on s390x is the stack-pointer instead of the flags. Note: The order of arguments and its names are hardcoded in seccomp-tools disassembler. The seccomp filter is using the argument index. The ">> 32" also belongs to an hardcoded output of seccomp-tools depending of even or odd index of the argument. I've saw, that bwrap can apply a seccomp-filer: bubblewrap.c:do_init(): if (seccomp_prog != NULL && prctl (PR_SET_SECCOMP, SECCOMP_MODE_FILTER, seccomp_prog) != 0) die_with_error ("prctl(PR_SET_SECCOMP)"); This is executed if you call bwrap with "--seccomp FD" (Load and use seccomp rules from FD) I've also dumped the /proc/PID/cmdline for the processes: flatpak-builder\0-v\0--repo=/path/to/workdir\0--force-clean\0appdir\0test.json\0 bwrap\0--args\012\0./configure\0--prefix=/app\0--some-arg\0 bwrap\0--args\012\0./configure\0--prefix=/app\0--some-arg\0 /bin/sh\0./configure\0--prefix=/app\0--some-arg\0 Thus I suppose bwrap is not adding this seccomp filter. I had a look to /proc/<PID of configure>/cgroup 12:cpuset:/ 11:perf_event:/ 10:devices:/user.slice 9:rdma:/ 8:pids:/user.slice/user-1001.slice/user@1001.service 7:memory:/user.slice/user-1001.slice/user@1001.service 6:hugetlb:/ 5:net_cls,net_prio:/ 4:blkio:/user.slice 3:cpu,cpuacct:/user.slice 2:freezer:/ 1:name=systemd:/user.slice/user-1001.slice/user@1001.service/flatpak-org.test.Hello2-14224.scope 0::/user.slice/user-1001.slice/user@1001.service/flatpak-org.test.Hello2-14224.scope It could be that systemd is applying the seccomp-filter, but I don't know how. Can anybody help? For a test, the seccomp-filter could be adjusted, to check the second argument for the clone syscall. Of course, for a real patch, the index has to be determined depending on the current architecture. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1886814 Title: posix_spawn usage in gnu make causes failures on s390x Status in Ubuntu on IBM z Systems: Triaged Status in glibc package in Ubuntu: New Status in linux package in Ubuntu: Incomplete Status in make-dfsg package in Ubuntu: New Bug description: posix_spawn usage in gnu make causes failures on s390x Recently in gnu-make v4.3 https://paste.ubuntu.com/p/tYhbJFKN76/ it started to use posix_spawn, instead of fork()/exec(). This has caused failure of an unrelated package flatpak-builder autopkgtests on s390x only, like so echo Building make: echo: Operation not permitted make: *** [Makefile:2: all] Error 127 Julian Klaude investigated this in-depth. His earlier research also indicated that this is a heisenbug, if one tries to print to stderr before printing to stdout, no issue occurs. We are configuring GNU make to be build with --disable-posix-spawn on s390x only. We passed these details to Debian https://bugs.debian.org /cgi-bin/bugreport.cgi?bug=964541 too. But I do wonder, if there is something different or incorrect about posix_spawn() implementation in either glibc, or linux kernel, on s390x. Or gnu-make's usage of posix_spawn(). As otherise, using posix_spawn() in gnu-make works on other architectures, and flatpak-builder autopkgtests pass too. It seems very weird that stdout does not appear to be functional, unless stderr was opened/written to, from gnu-make execution compiled with posix-spawn feature. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-z-systems/+bug/1886814/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp