Aside from a1.metal and c6g.8xlarge, no other instance types appear to be affected at the moment (note: other instance sizes for c6g are untested). The following issues emphasize the need for resolution, particularly the second point:
* WARN messages and kernel taint. * Fewer usable CPUs than expected for end users (a1.metal: paying for 16 CPUs but can use only 4; c6g.8xlarge: paying for 32 CPUs but can use only 16) * Potential issues for userspace programs that rely on online CPU information. Given the possibility of encountering other problematic ACPI table patterns on untested instance types, I'm inclined to pursue option "(c): set CONFIG_ACPI_HOTPLUG_CPU=n in some way" (see [Solution] section in description). I tested a patch (for Ubuntu-aws-6.11.0-1006.6) on: * a1.metal * c6g.8xlarge * a1.medium * c7g.xlarge * m8g.2xlarge The patch resolved the issues on a1.metal/c6g.8xlarge and seemed not to introduce any new issue on all instance types. -- You received this bug notification because you are a member of Canonical Platform QA Team, which is subscribed to ubuntu-kernel-tests. https://bugs.launchpad.net/bugs/2088047 Title: log_check / kernel_tainted test from ubuntu_boot failed on Oracular AWS a1.metal Status in ubuntu-kernel-tests: New Bug description: Found on Oracular/6.11.0-11.11 boot testing on AWS a1.metal instance. The relevant console log excerpts: -----(snip)----- 06:55:12 INFO | 2024-11-09T06:51:17.584884+00:00 ip-172-31-6-235 kernel: cpuinfo: failed to register hotplug callbacks. -----(snip)----- 06:55:12 INFO | 2024-11-09T06:51:17.584978+00:00 ip-172-31-6-235 kernel: ------------[ cut here ]------------ 06:55:12 INFO | 2024-11-09T06:51:17.584980+00:00 ip-172-31-6-235 kernel: WARNING: CPU: 7 PID: 1 at fs/sysfs/group.c:128 internal_create_group+0xc4/0x380 06:55:12 INFO | 2024-11-09T06:51:17.584981+00:00 ip-172-31-6-235 kernel: Modules linked in: 06:55:12 INFO | 2024-11-09T06:51:17.584983+00:00 ip-172-31-6-235 kernel: CPU: 7 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.11.0-11-generic #11-Ubuntu 06:55:12 INFO | 2024-11-09T06:51:17.584984+00:00 ip-172-31-6-235 kernel: Hardware name: Amazon EC2 a1.metal/Not Specified, BIOS 1.0 10/16/2017 06:55:12 INFO | 2024-11-09T06:51:17.584985+00:00 ip-172-31-6-235 kernel: pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) 06:55:12 INFO | 2024-11-09T06:51:17.584987+00:00 ip-172-31-6-235 kernel: pc : internal_create_group+0xc4/0x380 06:55:12 INFO | 2024-11-09T06:51:17.584989+00:00 ip-172-31-6-235 kernel: lr : sysfs_create_group+0x24/0x50 06:55:12 INFO | 2024-11-09T06:51:17.584993+00:00 ip-172-31-6-235 kernel: sp : ffff80008009bb90 06:55:12 INFO | 2024-11-09T06:51:17.584995+00:00 ip-172-31-6-235 kernel: x29: ffff80008009bba0 x28: 0000000000000000 x27: ffff19093bd33ca8 06:55:12 INFO | 2024-11-09T06:51:17.584997+00:00 ip-172-31-6-235 kernel: x26: 0000000000000000 x25: ffff436d28704000 x24: ffffd59c11b04a88 06:55:12 INFO | 2024-11-09T06:51:17.584998+00:00 ip-172-31-6-235 kernel: x23: 0000000000000000 x22: ffffd59c14046768 x21: ffffd59c1362fca8 06:55:12 INFO | 2024-11-09T06:51:17.585000+00:00 ip-172-31-6-235 kernel: x20: 0000000000000036 x19: 0000000000000004 x18: ffff800080095060 06:55:12 INFO | 2024-11-09T06:51:17.585001+00:00 ip-172-31-6-235 kernel: x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 06:55:12 INFO | 2024-11-09T06:51:17.585003+00:00 ip-172-31-6-235 kernel: x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 06:55:12 INFO | 2024-11-09T06:51:17.585006+00:00 ip-172-31-6-235 kernel: x11: 0000000000000000 x10: 0000000000000000 x9 : ffffd59c1128fc4c 06:55:12 INFO | 2024-11-09T06:51:17.585008+00:00 ip-172-31-6-235 kernel: x8 : 0101010101010101 x7 : 0000000000000000 x6 : 0000000000000000 06:55:12 INFO | 2024-11-09T06:51:17.585010+00:00 ip-172-31-6-235 kernel: x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffff1902003fa280 06:55:12 INFO | 2024-11-09T06:51:17.585011+00:00 ip-172-31-6-235 kernel: x2 : ffffd59c12648f88 x1 : 0000000000000000 x0 : 0000000000000000 06:55:12 INFO | 2024-11-09T06:51:17.585012+00:00 ip-172-31-6-235 kernel: Call trace: 06:55:12 INFO | 2024-11-09T06:51:17.585013+00:00 ip-172-31-6-235 kernel: internal_create_group+0xc4/0x380 06:55:12 INFO | 2024-11-09T06:51:17.585014+00:00 ip-172-31-6-235 kernel: sysfs_create_group+0x24/0x50 06:55:12 INFO | 2024-11-09T06:51:17.585015+00:00 ip-172-31-6-235 kernel: topology_add_dev+0x28/0x50 06:55:12 INFO | 2024-11-09T06:51:17.585016+00:00 ip-172-31-6-235 kernel: cpuhp_invoke_callback+0x200/0x780 06:55:12 INFO | 2024-11-09T06:51:17.585021+00:00 ip-172-31-6-235 kernel: cpuhp_issue_call+0x100/0x198 06:55:12 INFO | 2024-11-09T06:51:17.585023+00:00 ip-172-31-6-235 kernel: __cpuhp_setup_state_cpuslocked+0x128/0x330 06:55:12 INFO | 2024-11-09T06:51:17.585024+00:00 ip-172-31-6-235 kernel: __cpuhp_setup_state+0x5c/0xa8 06:55:12 INFO | 2024-11-09T06:51:17.585025+00:00 ip-172-31-6-235 kernel: topology_sysfs_init+0x40/0x78 06:55:12 INFO | 2024-11-09T06:51:17.585026+00:00 ip-172-31-6-235 kernel: do_one_initcall+0x64/0x3a0 06:55:12 INFO | 2024-11-09T06:51:17.585027+00:00 ip-172-31-6-235 kernel: do_initcalls+0x19c/0x210 06:55:12 INFO | 2024-11-09T06:51:17.585028+00:00 ip-172-31-6-235 kernel: kernel_init_freeable+0x18c/0x1e8 06:55:12 INFO | 2024-11-09T06:51:17.585029+00:00 ip-172-31-6-235 kernel: kernel_init+0x3c/0x190 06:55:12 INFO | 2024-11-09T06:51:17.585031+00:00 ip-172-31-6-235 kernel: ret_from_fork+0x10/0x20 06:55:12 INFO | 2024-11-09T06:51:17.585035+00:00 ip-172-31-6-235 kernel: ---[ end trace 0000000000000000 ]--- 06:55:12 INFO | 2024-11-09T06:51:17.585037+00:00 ip-172-31-6-235 kernel: sysfs: cannot create duplicate filename '/devices/cache' 06:55:12 INFO | 2024-11-09T06:51:17.585038+00:00 ip-172-31-6-235 kernel: CPU: 5 UID: 0 PID: 47 Comm: cpuhp/5 Tainted: G W 6.11.0-11-generic #11-Ubuntu 06:55:12 INFO | 2024-11-09T06:51:17.585039+00:00 ip-172-31-6-235 kernel: Tainted: [W]=WARN 06:55:12 INFO | 2024-11-09T06:51:17.585040+00:00 ip-172-31-6-235 kernel: Hardware name: Amazon EC2 a1.metal/Not Specified, BIOS 1.0 10/16/2017 06:55:12 INFO | 2024-11-09T06:51:17.585041+00:00 ip-172-31-6-235 kernel: Call trace: 06:55:12 INFO | 2024-11-09T06:51:17.585146+00:00 ip-172-31-6-235 kernel: dump_backtrace+0x104/0x160 06:55:12 INFO | 2024-11-09T06:51:17.585149+00:00 ip-172-31-6-235 kernel: show_stack+0x24/0x50 06:55:12 INFO | 2024-11-09T06:51:17.585150+00:00 ip-172-31-6-235 kernel: dump_stack_lvl+0x84/0xc0 06:55:12 INFO | 2024-11-09T06:51:17.585155+00:00 ip-172-31-6-235 kernel: dump_stack+0x1c/0x40 06:55:12 INFO | 2024-11-09T06:51:17.585191+00:00 ip-172-31-6-235 kernel: sysfs_warn_dup+0xa8/0xf0 06:55:12 INFO | 2024-11-09T06:51:17.585193+00:00 ip-172-31-6-235 kernel: sysfs_create_dir_ns+0x124/0x150 06:55:12 INFO | 2024-11-09T06:51:17.585194+00:00 ip-172-31-6-235 kernel: create_dir+0x30/0x120 06:55:12 INFO | 2024-11-09T06:51:17.585215+00:00 ip-172-31-6-235 kernel: kobject_add_internal+0x90/0x240 06:55:12 INFO | 2024-11-09T06:51:17.585218+00:00 ip-172-31-6-235 kernel: kobject_add+0xa0/0x140 06:55:12 INFO | 2024-11-09T06:51:17.585234+00:00 ip-172-31-6-235 kernel: device_add+0xd8/0x748 06:55:12 INFO | 2024-11-09T06:51:17.585236+00:00 ip-172-31-6-235 kernel: cpu_device_create+0x19c/0x1c0 06:55:12 INFO | 2024-11-09T06:51:17.585238+00:00 ip-172-31-6-235 kernel: cache_add_dev+0x84/0x428 06:55:12 INFO | 2024-11-09T06:51:17.585252+00:00 ip-172-31-6-235 kernel: cacheinfo_cpu_online+0x90/0x138 06:55:12 INFO | 2024-11-09T06:51:17.585254+00:00 ip-172-31-6-235 kernel: cpuhp_invoke_callback+0x200/0x780 06:55:12 INFO | 2024-11-09T06:51:17.585256+00:00 ip-172-31-6-235 kernel: cpuhp_thread_fun+0x140/0x358 06:55:12 INFO | 2024-11-09T06:51:17.585281+00:00 ip-172-31-6-235 kernel: smpboot_thread_fn+0x224/0x250 06:55:12 INFO | 2024-11-09T06:51:17.585287+00:00 ip-172-31-6-235 kernel: kthread+0xf4/0x108 06:55:12 INFO | 2024-11-09T06:51:17.585289+00:00 ip-172-31-6-235 kernel: ret_from_fork+0x10/0x20 06:55:12 INFO | 2024-11-09T06:51:17.585299+00:00 ip-172-31-6-235 kernel: kobject: kobject_add_internal failed for cache with -EEXIST, don't try to register things with the same name in the same directory. This also was observed on 6.11.0-1004-aws and 6.11.0-1005-aws. Note that Noble is not affected. See [Affected versions] section for more details. ------------------------------------- [Summary] - This is not a regression but caused by problematic ACPI table on a1.metal. - If ACPI table won't be fixed soon, it might be an option to add a workaround at least in our tree. Please see more details in section [Solution] [Cause] According to the warn messages, the following two are failing: * cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "arm64/cpuinfo:online", cpuid_cpu_online, cpuid_cpu_offline) * cpuhp_setup_state(CPUHP_AP_BASE_CACHEINFO_ONLINE, "base/cacheinfo:online", cacheinfo_cpu_online, cacheinfo_cpu_pre_down) Note that there are other cpuhp callbacks that are failing. Boot- time tracing of cpuhp:* events reveals it: 4) | /* cpuhp_enter: cpu: 0004 target: 238 step: 199 (cpu_capacity_sysctl_add) */ 4) | /* cpuhp_exit: cpu: 0004 state: 238 step: 199 ret: -2 */ 4) | /* cpuhp_enter: cpu: 0004 target: 238 step: 199 (cpuid_cpu_online) */ 4) | /* cpuhp_exit: cpu: 0004 state: 238 step: 199 ret: -19 */ 5) | /* cpuhp_enter: cpu: 0004 target: 238 step: 54 (topology_add_dev) */ 5) | /* cpuhp_exit: cpu: 0004 state: 238 step: 54 ret: -22 */ 5) | /* cpuhp_enter: cpu: 0005 target: 238 step: 193 (cacheinfo_cpu_online) */ 5) | /* cpuhp_exit: cpu: 0005 state: 238 step: 193 ret: -17 */ These failures are due to non-enabled CPU#4-15 despite that they are in cpu_possible_mask and also online. The issue is that acpi_get_phys_id() fails to get phys_id for processor devices (CPU#4-15) because of discrepancies in ACPI table. -> acpi_processor_get_info -> acpi_get_phys_id -> map_mat_entry -> map_madt_entry Processor Device _UIDs are sequential numbers starting from 0, while Processor UIDs in MADT/PPTT are non-sequential (0x0, 0x1, 0x2, 0x3, 0x100, 0x101, 0x102, 0x103, 0x200, 0x201, ...). This results in the map_madt_entry() failure for CPU#4-15. [Affected Versions] * All Oracular kernels are affected at the moment. * All Noble kernels are not affected at the moment. This is because only Oracular set CONFIG_ACPI_HOTPLUG_CPU=y because of the two upstream commits: 9d0873892f4d ("arm64: Kconfig: Enable hotplug CPU on arm64 if ACPI_PROCESSOR is enabled.") 46800e38ef0e ("arm64: Kconfig: Fix dependencies to enable ACPI_HOTPLUG_CPU") which are originally included in its master kernel. [Solution] There are some options: (a). override ACPI table (while waiting for firmware update) (b). apply a workaround patch for o:aws (c). set CONFIG_ACPI_HOTPLUG_CPU=n in some way [Experiment] Regarding (b), I cooked up a workaround patch (dirty hack), and confirmed that acpi_processor_get_info() turns to succeed for all CPU#4-15 and the warn messages disappeared. See the attached. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/2088047/+subscriptions -- Mailing list: https://launchpad.net/~canonical-ubuntu-qa Post to : canonical-ubuntu-qa@lists.launchpad.net Unsubscribe : https://launchpad.net/~canonical-ubuntu-qa More help : https://help.launchpad.net/ListHelp