This bug is awaiting verification that the linux-aws/5.4.0-1138.148 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal-linux-aws' to 'verification-done-focal- linux-aws'. If the problem still exists, change the tag 'verification- needed-focal-linux-aws' to 'verification-failed-focal-linux-aws'.
If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you! ** Tags added: kernel-spammed-focal-linux-aws-v2 verification-needed-focal-linux-aws -- You received this bug notification because you are a member of Canonical Platform QA Team, which is subscribed to ubuntu-kernel-tests. https://bugs.launchpad.net/bugs/2085493 Title: futex_wait_bitset01, memfd_create0[1-4], copy_file_range0[1-3], statx[01-12], membarrier01, io_uring0[1-2], perf_event_open03 from ubuntu_ltp_syscalls failed on B-aws-5.4 Status in ubuntu-kernel-tests: New Bug description: Occasionally, multiple test cases from ubuntu_ltp_syscalls fail on b:aws-5.4 with the same pettern. The following is futex_wait_bitset01 failure log as an example. 17:15:00 ERROR| Exception escaping from test: Traceback (most recent call last): File "/home/ubuntu/autotest/client/shared/test.py", line 411, in _exec _call_test_function(self.execute, *p_args, **p_dargs) File "/home/ubuntu/autotest/client/shared/test.py", line 823, in _call_test_function return func(*args, **dargs) File "/home/ubuntu/autotest/client/shared/test.py", line 291, in execute postprocess_profiled_run, args, dargs) File "/home/ubuntu/autotest/client/shared/test.py", line 212, in _call_run_once self.run_once(*args, **dargs) File "/home/ubuntu/autotest/client/tests/ubuntu_ltp_syscalls/ubuntu_ltp_syscalls.py", line 155, in run_once print(utils.system_output(cmd, verbose=False)) File "/home/ubuntu/autotest/client/shared/utils.py", line 1271, in system_output verbose=verbose, args=args).stdout File "/home/ubuntu/autotest/client/shared/utils.py", line 918, in run "Command returned non-zero exit status") CmdError: Command </opt/ltp/runltp -f /tmp/target -q -C /dev/null -l /dev/null -T /dev/null> failed, rc=-7, Command returned non-zero exit status * Command: /opt/ltp/runltp -f /tmp/target -q -C /dev/null -l /dev/null -T /dev/null Exit status: -7 Duration: 0.00109004974365 17:15:00 INFO | post-test sysinfo error: 17:15:00 INFO | Traceback (most recent call last): 17:15:00 INFO | File "/home/ubuntu/autotest/client/shared/log.py", line 74, in decorated_func 17:15:00 INFO | fn(*args, **dargs) 17:15:00 INFO | File "/home/ubuntu/autotest/client/base_sysinfo.py", line 304, in log_after_each_test 17:15:00 INFO | keyval = self.log_test_keyvals(test_sysinfodir) 17:15:00 INFO | File "/home/ubuntu/autotest/client/base_sysinfo.py", line 413, in log_test_keyvals 17:15:00 INFO | keyval["sysinfo-phys-mbytes"] = utils_memory.rounded_memtotal() // 1024 17:15:00 INFO | File "/home/ubuntu/autotest/client/shared/utils_memory.py", line 25, in rounded_memtotal 17:15:00 INFO | usable_kbytes = memtotal() 17:15:00 INFO | File "/home/ubuntu/autotest/client/shared/utils_memory.py", line 16, in memtotal 17:15:00 INFO | return read_from_meminfo('MemTotal') 17:15:00 INFO | File "/home/ubuntu/autotest/client/shared/utils_memory.py", line 10, in read_from_meminfo 17:15:00 INFO | cmd_result = utils.run('grep %s /proc/meminfo' % key, verbose=False) 17:15:00 INFO | File "/home/ubuntu/autotest/client/shared/utils.py", line 918, in run 17:15:00 INFO | "Command returned non-zero exit status") 17:15:00 INFO | CmdError: Command <grep MemTotal /proc/meminfo> failed, rc=-7, Command returned non-zero exit status 17:15:00 INFO | * Command: 17:15:00 INFO | grep MemTotal /proc/meminfo 17:15:00 INFO | Exit status: -7 17:15:00 INFO | Duration: 0.00239610671997 17:15:00 INFO | 17:15:00 ERROR| child process failed 17:15:00 DEBUG| Traceback (most recent call last): 17:15:00 DEBUG| File "/home/ubuntu/autotest/client/parallel.py", line 25, in fork_start 17:15:00 DEBUG| l() 17:15:00 DEBUG| File "/home/ubuntu/autotest/client/job.py", line 505, in <lambda> 17:15:00 DEBUG| l = lambda: test.runtest(self, url, tag, args, dargs) 17:15:00 DEBUG| File "/home/ubuntu/autotest/client/test.py", line 125, in runtest 17:15:00 DEBUG| job.sysinfo.log_after_each_iteration) 17:15:00 DEBUG| File "/home/ubuntu/autotest/client/shared/test.py", line 913, in runtest 17:15:00 DEBUG| mytest._exec(args, dargs) 17:15:00 DEBUG| File "/home/ubuntu/autotest/client/shared/test.py", line 411, in _exec 17:15:00 DEBUG| _call_test_function(self.execute, *p_args, **p_dargs) 17:15:00 DEBUG| File "/home/ubuntu/autotest/client/shared/test.py", line 823, in _call_test_function 17:15:00 DEBUG| return func(*args, **dargs) 17:15:00 DEBUG| File "/home/ubuntu/autotest/client/shared/test.py", line 291, in execute 17:15:00 DEBUG| postprocess_profiled_run, args, dargs) 17:15:00 DEBUG| File "/home/ubuntu/autotest/client/shared/test.py", line 212, in _call_run_once 17:15:00 DEBUG| self.run_once(*args, **dargs) 17:15:00 DEBUG| File "/home/ubuntu/autotest/client/tests/ubuntu_ltp_syscalls/ubuntu_ltp_syscalls.py", line 155, in run_once 17:15:00 DEBUG| print(utils.system_output(cmd, verbose=False)) 17:15:00 DEBUG| File "/home/ubuntu/autotest/client/shared/utils.py", line 1271, in system_output 17:15:00 DEBUG| verbose=verbose, args=args).stdout 17:15:00 DEBUG| File "/home/ubuntu/autotest/client/shared/utils.py", line 918, in run 17:15:00 DEBUG| "Command returned non-zero exit status") 17:15:00 DEBUG| CmdError: Command </opt/ltp/runltp -f /tmp/target -q -C /dev/null -l /dev/null -T /dev/null> failed, rc=-7, Command returned non-zero exit status 17:15:00 DEBUG| * Command: 17:15:00 DEBUG| /opt/ltp/runltp -f /tmp/target -q -C /dev/null -l /dev/null -T /dev/null 17:15:00 DEBUG| Exit status: -7 17:15:00 DEBUG| Duration: 0.00109004974365 17:15:00 DEBUG| 17:15:02 INFO | ERROR ubuntu_ltp_syscalls.futex_wait_bitset01 ubuntu_ltp_syscalls.futex_wait_bitset01 timestamp=1729617302 localtime=Oct 22 17:15:02 Command </opt/ltp/runltp -f /tmp/target -q -C /dev/null -l /dev/null -T /dev/null> failed, rc=-7, Command returned non-zero exit status * Command: /opt/ltp/runltp -f /tmp/target -q -C /dev/null -l /dev/null -T /dev/null Exit status: -7 Duration: 0.00109004974365 ------------------------------------- [Summary] - The problem could potentially occur when madvise01/07/11 misbehaves. - Most likely when madvise11 failed to unpoison due to /dev/kmsg overrun. - For b:aws-5.4, CONFIG_LOG_BUF_SHIFT has been kept 14 for some reason. - This is not a regression nor a critical issue. [Problem description] Sometimes all the test cases after futex_wake04 fail (flaky). - futex_wait_bitset01 - memfd_create01 - memfd_create02 - memfd_create03 [*] - memfd_create04 - copy_file_range01 - copy_file_range02 - copy_file_range03 - statx01 - statx02 - statx03 - statx04 - statx05 - statx06 - statx07 - statx08 - statx09 - statx10 - statx11 - statx12 - membarrier01 - io_uring01 - io_uring02 - perf_event_open03 [*] Sometimes only memfd_create03 fails. During futex_wake04, I observed that a bash process was killed with SIGBUS. * dmesg: [...] Oct 22 11:36:20 ip-172-31-21-201 kernel: BTRFS info (device loop7): creating UUID tree Oct 22 11:38:56 ip-172-31-21-201 kernel: futex_wake04 (21315): drop_caches: 3 > Oct 22 11:38:56 ip-172-31-21-201 AutotestCrashHandler[21322]: Application bash, PID 21321 crashed > Oct 22 11:38:56 ip-172-31-21-201 AutotestCrashHandler[21322]: Writing core files to \ ['/home/ubuntu/autotest/client/results/default/ubuntu_ltp_syscalls.futex_wake04/debug/crash.bash.21321'] Oct 22 11:39:47 ip-172-31-21-201 sudo[2141]: pam_unix(sudo:session): session closed for user root [...] * where it crashed: ubuntu@ip-172-31-21-201:~/_bin_bash.0.crash$ gdb -q /bin/bash -ex bt ./CoreDump [...] Core was generated by `/bin/bash -c grep MemTotal /proc/meminfo'. Program terminated with signal SIGBUS, Bus error. [...] #0 0x0000ffff996921d4 in elf_machine_rela_relative (reloc_addr_arg=0xaaaae4c443c8 <static_shell_builtins>, \ reloc=0xaaaae4b60708, l_addr=187650958139392) at ../sysdeps/aarch64/dl-machine.h:376 #1 elf_dynamic_do_Rela (skip_ifunc=<optimized out>, lazy=0, nrelative=<optimized out>, relsize=<optimized out>, reladdr=<optimized out>, map=0xffff996b61b0) at do-rel.h:112 #2 _dl_relocate_object (scope=<optimized out>, reloc_mode=<optimized out>, consider_profiling=<optimized out>, consider_profiling@entry=0) at dl-reloc.c:258 #3 0x0000ffff9968ac90 in dl_main (phdr=<optimized out>, phnum=<optimized out>, user_entry=<optimized out>, auxv=<optimized out>) at rtld.c:2190 #4 0x0000ffff9969bd70 in _dl_sysdep_start (start_argptr=start_argptr@entry=0xffffff7b4870, dl_main=dl_main@entry=0xffff996890a0 <dl_main>) at ../elf/dl-sysdep.c:253 #5 0x0000ffff996888c4 in _dl_start_final (arg=0xffffff7b4870, arg@entry=0xffff996b4e68, info=info@entry=0xffffff7b4400) at rtld.c:414 #6 0x0000ffff99688b48 in _dl_start (arg=0xffff996b4e68) at rtld.c:523 #7 0x0000ffff996881c8 in _start () from /lib/ld-linux-aarch64.so.1 Backtrace stopped: previous frame identical to this frame (corrupt stack?) - source view ``` [glibc:import/2.27-3ubuntu1.6:sysdeps/aarch64/dl-machine.h] 369 inline void 370 __attribute__ ((always_inline)) 371 elf_machine_rela_relative (ElfW(Addr) l_addr, 372 const ElfW(Rela) *reloc, 373 void *const reloc_addr_arg) 374 { 375 ElfW(Addr) *const reloc_addr = reloc_addr_arg; > 376 *reloc_addr = l_addr + reloc->r_addend; 377 } ``` - asm view ``` > 0xffff996921d4 <_dl_relocate_object+2412> str x0, [x1, x22] ``` - reg view ``` (excerpt) x0 0xaaaae4c28b40 # l_addr + reloc->r_addend x1 0x10a3c8 x22 0xaaaae4b3a000 # l_addr ``` Apparently it's unusual that dynamic loader crashes on such a normal code path / instruction. From this time onward, the system gradually becomes unstable. Even a new ssh login fails, running journalctl aborts and cannot see logs, etc. [Cause] After digging this issue, I figured out that madvise11 test case fails to unpoison soft-offlined pages. While neither of madvise01 nor madvise07 unpoisons as well, the amount of poisoned pages which is left by madvise11 is relatively huge, so it's more likely to result in unexpected process crashes later. * madvise01 : 10 pfns hard-offlined (via MADV_HWPOISON) * madvise07 : 1 pfn hard-offlined (via MADV_HWPOISON) * madvise11 : 800 pfns soft-offlined The reason madvise11 fails to unpoison is /dev/kmsg overrun. Before it starts poisoning, it echoes a marker to /dev/kmsg, then after that kernel emits 800 lines of messages for each soft-offlining. [...] kernel: Soft offlining pfn 0xb58bc4 at process virtual address 0xffff8dd4f000 [...] Then /dev/kmsg overruns and the marker becomes unseeable. madvise11 relies on the marker in its cleanup stage to search for pfns to unpoison. Because it fails to find the marker, it fails to unpoison any poisoned pfn at all. You can see the following line in console: 16:33:28 INFO | madvise11.c:363: TINFO: Restore 0 Soft-offlined pages So, there seems to be two problems: (A). madvise11 fails to unpoison soft-offlined pages. (B). Neither of madvise01 nor madvise07 unpoisons hard-offlined pages. Here, (A) is more likely to lead to random process crashes with unexpected SIGBUS. Roughly speaking, probabilistically 98% (=800/(10+1+800)) of the issue occurrence should have been caused by (A). [Solution] Possible solutions are described below. Regarding (A), - (a0). Propagate CONFIG_LOG_BUF_SHIFT=18 for arm64 b:aws-5.4 - (a1). Set kernel parm `log_buf_len=` to a sufficiently large value, - (a2). Reduce the number of pages to offline (800 -> 400) with a patch against ltp - (a3). Modify madvise11 codes so that it does not rely on /dev/kmsg (i.e., use pagemap and record the poisoned pfns internally for cleanup phase) Regarding (B), - (b1) Add cleanup (=unpoison) codes for hard-offlining (via MADV_HWPOISON) [Experiment] I ran the whole ubuntu_ltp_syscalls four times for each with and without the patch for (a2). with this patch: 4 pass, 0 fail without this patch: 2 pass [*], 2 fail [*] even though all the tests but splice07 passed, the system got unstable and some programs randomly crashes (e.g., journalctl). Of course just setting log_buf_len to a sufficiently large value would give us the same result. [Why it did not occur previously] For a1.medium, it seems that it just happened not to occur. There was a line in the log for s2024.09.02 a1.medium: 01:52:44 INFO | madvise11.c:363: TINFO: Restore 0 Soft-offlined pages The issue could potentially occur on a1.medium, c7g.xlarge, m8g.2xlarge. arm64 instances | vCPU | log_buf_len | # soft-offlining | Required buf size (*2) | affected | --------------------+------+-------------+------------------+------------------------+----------+ a1.medium | 1 | 16KiB | 320 | 90*320=28800 (>16KiB!) | o | c6g.8xlarge | 32 | 256KiB (*1) | 800 | 90*800=72000 (<256KiB) | | c7g.xlarge [NEW] | 4 | 16KiB | 640 | 90*640=57600 (>16KiB!) | o | m8g.2xlarge [NEW] | 8 | 64KiB (*1) | 800 | 90*800=72000 (>64KiB!) | o | (*1): CONFIG_LOG_BUF_SHIFT=14, CONFIG_LOG_CPU_MAX_BUF_SHIFT=12 See more details in log_buf_add_cpu() why it's larger than 1<<14. (*2): One line "kernel: Soft offlining pfn ..." consumes 90 Bytes. [Next Step] I'll add a UBUNTU SAUCE patch to b:aws-5.4 so that CONFIG_LOG_BUF_SHIFT=18 is properly propagated. This is not a critical nor an urgent issue. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/2085493/+subscriptions -- Mailing list: https://launchpad.net/~canonical-ubuntu-qa Post to : canonical-ubuntu-qa@lists.launchpad.net Unsubscribe : https://launchpad.net/~canonical-ubuntu-qa More help : https://help.launchpad.net/ListHelp