Aside from a1.metal and c6g.8xlarge, no other instance types appear to
be affected at the moment (note: other instance sizes for c6g are
untested). The following issues emphasize the need for resolution,
particularly the second point:

* WARN messages and kernel taint.
* Fewer usable CPUs than expected for end users (a1.metal: paying for 16 CPUs 
but can use only 4; c6g.8xlarge: paying for 32 CPUs but can use only 16)
* Potential issues for userspace programs that rely on online CPU information.

Given the possibility of encountering other problematic ACPI table
patterns on untested instance types, I'm inclined to pursue option "(c):
set CONFIG_ACPI_HOTPLUG_CPU=n in some way" (see [Solution] section in
description).

I tested a patch (for Ubuntu-aws-6.11.0-1006.6) on:
* a1.metal
* c6g.8xlarge
* a1.medium
* c7g.xlarge
* m8g.2xlarge

The patch resolved the issues on a1.metal/c6g.8xlarge and seemed not to
introduce any new issue on all instance types.

-- 
You received this bug notification because you are a member of Canonical
Platform QA Team, which is subscribed to ubuntu-kernel-tests.
https://bugs.launchpad.net/bugs/2088047

Title:
  log_check / kernel_tainted test from ubuntu_boot failed on Oracular
  AWS a1.metal

Status in ubuntu-kernel-tests:
  New

Bug description:
  Found on Oracular/6.11.0-11.11 boot testing on AWS a1.metal instance.
  The relevant console log excerpts:

  -----(snip)-----
  06:55:12 INFO | 2024-11-09T06:51:17.584884+00:00 ip-172-31-6-235 kernel: 
cpuinfo: failed to register hotplug callbacks.
  -----(snip)-----
  06:55:12 INFO | 2024-11-09T06:51:17.584978+00:00 ip-172-31-6-235 kernel: 
------------[ cut here ]------------
  06:55:12 INFO | 2024-11-09T06:51:17.584980+00:00 ip-172-31-6-235 kernel: 
WARNING: CPU: 7 PID: 1 at fs/sysfs/group.c:128 internal_create_group+0xc4/0x380
  06:55:12 INFO | 2024-11-09T06:51:17.584981+00:00 ip-172-31-6-235 kernel: 
Modules linked in:
  06:55:12 INFO | 2024-11-09T06:51:17.584983+00:00 ip-172-31-6-235 kernel: CPU: 
7 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.11.0-11-generic #11-Ubuntu
  06:55:12 INFO | 2024-11-09T06:51:17.584984+00:00 ip-172-31-6-235 kernel: 
Hardware name: Amazon EC2 a1.metal/Not Specified, BIOS 1.0 10/16/2017
  06:55:12 INFO | 2024-11-09T06:51:17.584985+00:00 ip-172-31-6-235 kernel: 
pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
  06:55:12 INFO | 2024-11-09T06:51:17.584987+00:00 ip-172-31-6-235 kernel: pc : 
internal_create_group+0xc4/0x380
  06:55:12 INFO | 2024-11-09T06:51:17.584989+00:00 ip-172-31-6-235 kernel: lr : 
sysfs_create_group+0x24/0x50
  06:55:12 INFO | 2024-11-09T06:51:17.584993+00:00 ip-172-31-6-235 kernel: sp : 
ffff80008009bb90
  06:55:12 INFO | 2024-11-09T06:51:17.584995+00:00 ip-172-31-6-235 kernel: x29: 
ffff80008009bba0 x28: 0000000000000000 x27: ffff19093bd33ca8
  06:55:12 INFO | 2024-11-09T06:51:17.584997+00:00 ip-172-31-6-235 kernel: x26: 
0000000000000000 x25: ffff436d28704000 x24: ffffd59c11b04a88
  06:55:12 INFO | 2024-11-09T06:51:17.584998+00:00 ip-172-31-6-235 kernel: x23: 
0000000000000000 x22: ffffd59c14046768 x21: ffffd59c1362fca8
  06:55:12 INFO | 2024-11-09T06:51:17.585000+00:00 ip-172-31-6-235 kernel: x20: 
0000000000000036 x19: 0000000000000004 x18: ffff800080095060
  06:55:12 INFO | 2024-11-09T06:51:17.585001+00:00 ip-172-31-6-235 kernel: x17: 
0000000000000000 x16: 0000000000000000 x15: 0000000000000000
  06:55:12 INFO | 2024-11-09T06:51:17.585003+00:00 ip-172-31-6-235 kernel: x14: 
0000000000000000 x13: 0000000000000000 x12: 0000000000000000
  06:55:12 INFO | 2024-11-09T06:51:17.585006+00:00 ip-172-31-6-235 kernel: x11: 
0000000000000000 x10: 0000000000000000 x9 : ffffd59c1128fc4c
  06:55:12 INFO | 2024-11-09T06:51:17.585008+00:00 ip-172-31-6-235 kernel: x8 : 
0101010101010101 x7 : 0000000000000000 x6 : 0000000000000000
  06:55:12 INFO | 2024-11-09T06:51:17.585010+00:00 ip-172-31-6-235 kernel: x5 : 
0000000000000000 x4 : 0000000000000000 x3 : ffff1902003fa280
  06:55:12 INFO | 2024-11-09T06:51:17.585011+00:00 ip-172-31-6-235 kernel: x2 : 
ffffd59c12648f88 x1 : 0000000000000000 x0 : 0000000000000000
  06:55:12 INFO | 2024-11-09T06:51:17.585012+00:00 ip-172-31-6-235 kernel: Call 
trace:
  06:55:12 INFO | 2024-11-09T06:51:17.585013+00:00 ip-172-31-6-235 kernel:  
internal_create_group+0xc4/0x380
  06:55:12 INFO | 2024-11-09T06:51:17.585014+00:00 ip-172-31-6-235 kernel:  
sysfs_create_group+0x24/0x50
  06:55:12 INFO | 2024-11-09T06:51:17.585015+00:00 ip-172-31-6-235 kernel:  
topology_add_dev+0x28/0x50
  06:55:12 INFO | 2024-11-09T06:51:17.585016+00:00 ip-172-31-6-235 kernel:  
cpuhp_invoke_callback+0x200/0x780
  06:55:12 INFO | 2024-11-09T06:51:17.585021+00:00 ip-172-31-6-235 kernel:  
cpuhp_issue_call+0x100/0x198
  06:55:12 INFO | 2024-11-09T06:51:17.585023+00:00 ip-172-31-6-235 kernel:  
__cpuhp_setup_state_cpuslocked+0x128/0x330
  06:55:12 INFO | 2024-11-09T06:51:17.585024+00:00 ip-172-31-6-235 kernel:  
__cpuhp_setup_state+0x5c/0xa8
  06:55:12 INFO | 2024-11-09T06:51:17.585025+00:00 ip-172-31-6-235 kernel:  
topology_sysfs_init+0x40/0x78
  06:55:12 INFO | 2024-11-09T06:51:17.585026+00:00 ip-172-31-6-235 kernel:  
do_one_initcall+0x64/0x3a0
  06:55:12 INFO | 2024-11-09T06:51:17.585027+00:00 ip-172-31-6-235 kernel:  
do_initcalls+0x19c/0x210
  06:55:12 INFO | 2024-11-09T06:51:17.585028+00:00 ip-172-31-6-235 kernel:  
kernel_init_freeable+0x18c/0x1e8
  06:55:12 INFO | 2024-11-09T06:51:17.585029+00:00 ip-172-31-6-235 kernel:  
kernel_init+0x3c/0x190
  06:55:12 INFO | 2024-11-09T06:51:17.585031+00:00 ip-172-31-6-235 kernel:  
ret_from_fork+0x10/0x20
  06:55:12 INFO | 2024-11-09T06:51:17.585035+00:00 ip-172-31-6-235 kernel: ---[ 
end trace 0000000000000000 ]---
  06:55:12 INFO | 2024-11-09T06:51:17.585037+00:00 ip-172-31-6-235 kernel: 
sysfs: cannot create duplicate filename '/devices/cache'
  06:55:12 INFO | 2024-11-09T06:51:17.585038+00:00 ip-172-31-6-235 kernel: CPU: 
5 UID: 0 PID: 47 Comm: cpuhp/5 Tainted: G        W          6.11.0-11-generic 
#11-Ubuntu
  06:55:12 INFO | 2024-11-09T06:51:17.585039+00:00 ip-172-31-6-235 kernel: 
Tainted: [W]=WARN
  06:55:12 INFO | 2024-11-09T06:51:17.585040+00:00 ip-172-31-6-235 kernel: 
Hardware name: Amazon EC2 a1.metal/Not Specified, BIOS 1.0 10/16/2017
  06:55:12 INFO | 2024-11-09T06:51:17.585041+00:00 ip-172-31-6-235 kernel: Call 
trace:
  06:55:12 INFO | 2024-11-09T06:51:17.585146+00:00 ip-172-31-6-235 kernel:  
dump_backtrace+0x104/0x160
  06:55:12 INFO | 2024-11-09T06:51:17.585149+00:00 ip-172-31-6-235 kernel:  
show_stack+0x24/0x50
  06:55:12 INFO | 2024-11-09T06:51:17.585150+00:00 ip-172-31-6-235 kernel:  
dump_stack_lvl+0x84/0xc0
  06:55:12 INFO | 2024-11-09T06:51:17.585155+00:00 ip-172-31-6-235 kernel:  
dump_stack+0x1c/0x40
  06:55:12 INFO | 2024-11-09T06:51:17.585191+00:00 ip-172-31-6-235 kernel:  
sysfs_warn_dup+0xa8/0xf0
  06:55:12 INFO | 2024-11-09T06:51:17.585193+00:00 ip-172-31-6-235 kernel:  
sysfs_create_dir_ns+0x124/0x150
  06:55:12 INFO | 2024-11-09T06:51:17.585194+00:00 ip-172-31-6-235 kernel:  
create_dir+0x30/0x120
  06:55:12 INFO | 2024-11-09T06:51:17.585215+00:00 ip-172-31-6-235 kernel:  
kobject_add_internal+0x90/0x240
  06:55:12 INFO | 2024-11-09T06:51:17.585218+00:00 ip-172-31-6-235 kernel:  
kobject_add+0xa0/0x140
  06:55:12 INFO | 2024-11-09T06:51:17.585234+00:00 ip-172-31-6-235 kernel:  
device_add+0xd8/0x748
  06:55:12 INFO | 2024-11-09T06:51:17.585236+00:00 ip-172-31-6-235 kernel:  
cpu_device_create+0x19c/0x1c0
  06:55:12 INFO | 2024-11-09T06:51:17.585238+00:00 ip-172-31-6-235 kernel:  
cache_add_dev+0x84/0x428
  06:55:12 INFO | 2024-11-09T06:51:17.585252+00:00 ip-172-31-6-235 kernel:  
cacheinfo_cpu_online+0x90/0x138
  06:55:12 INFO | 2024-11-09T06:51:17.585254+00:00 ip-172-31-6-235 kernel:  
cpuhp_invoke_callback+0x200/0x780
  06:55:12 INFO | 2024-11-09T06:51:17.585256+00:00 ip-172-31-6-235 kernel:  
cpuhp_thread_fun+0x140/0x358
  06:55:12 INFO | 2024-11-09T06:51:17.585281+00:00 ip-172-31-6-235 kernel:  
smpboot_thread_fn+0x224/0x250
  06:55:12 INFO | 2024-11-09T06:51:17.585287+00:00 ip-172-31-6-235 kernel:  
kthread+0xf4/0x108
  06:55:12 INFO | 2024-11-09T06:51:17.585289+00:00 ip-172-31-6-235 kernel:  
ret_from_fork+0x10/0x20
  06:55:12 INFO | 2024-11-09T06:51:17.585299+00:00 ip-172-31-6-235 kernel: 
kobject: kobject_add_internal failed for cache with -EEXIST, don't try to 
register things with the same name in the same directory.

  This also was observed on 6.11.0-1004-aws and 6.11.0-1005-aws.
  Note that Noble is not affected. See [Affected versions] section for more 
details.

  -------------------------------------

  [Summary]

    - This is not a regression but caused by problematic ACPI table on a1.metal.
    - If ACPI table won't be fixed soon, it might be an option to add a 
workaround at least in our tree. Please see more details in section [Solution]

  [Cause]

    According to the warn messages, the following two are failing:
    * cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "arm64/cpuinfo:online",
                        cpuid_cpu_online, cpuid_cpu_offline)
    * cpuhp_setup_state(CPUHP_AP_BASE_CACHEINFO_ONLINE, "base/cacheinfo:online",
                        cacheinfo_cpu_online, cacheinfo_cpu_pre_down)

    Note that there are other cpuhp callbacks that are failing. Boot-
  time tracing of cpuhp:* events reveals it:

    4)               |  /* cpuhp_enter: cpu: 0004 target: 238 step: 199 
(cpu_capacity_sysctl_add) */
    4)               |  /* cpuhp_exit:  cpu: 0004  state: 238 step: 199 ret: -2 
*/

    4)               |  /* cpuhp_enter: cpu: 0004 target: 238 step: 199 
(cpuid_cpu_online) */
    4)               |  /* cpuhp_exit:  cpu: 0004  state: 238 step: 199 ret: 
-19 */

    5)               |  /* cpuhp_enter: cpu: 0004 target: 238 step:  54 
(topology_add_dev) */
    5)               |  /* cpuhp_exit:  cpu: 0004  state: 238 step:  54 ret: 
-22 */

    5)               |  /* cpuhp_enter: cpu: 0005 target: 238 step: 193 
(cacheinfo_cpu_online) */
    5)               |  /* cpuhp_exit:  cpu: 0005  state: 238 step: 193 ret: 
-17 */

    These failures are due to non-enabled CPU#4-15 despite that they are in 
cpu_possible_mask and also online.
    The issue is that acpi_get_phys_id() fails to get phys_id for processor 
devices (CPU#4-15) because of
    discrepancies in ACPI table.

      -> acpi_processor_get_info
        -> acpi_get_phys_id
          -> map_mat_entry
          -> map_madt_entry

    Processor Device _UIDs are sequential numbers starting from 0, while 
Processor UIDs in MADT/PPTT
    are non-sequential (0x0, 0x1, 0x2, 0x3, 0x100, 0x101, 0x102, 0x103, 0x200, 
0x201, ...).
    This results in the map_madt_entry() failure for CPU#4-15.

  [Affected Versions]

    * All Oracular kernels are affected at the moment.
    * All Noble kernels are not affected at the moment.

    This is because only Oracular set CONFIG_ACPI_HOTPLUG_CPU=y because of the 
two upstream commits:
      9d0873892f4d ("arm64: Kconfig: Enable hotplug CPU on arm64 if 
ACPI_PROCESSOR is enabled.")
      46800e38ef0e ("arm64: Kconfig: Fix dependencies to enable 
ACPI_HOTPLUG_CPU")
    which are originally included in its master kernel.

  [Solution]

    There are some options:

    (a). override ACPI table (while waiting for firmware update)
    (b). apply a workaround patch for o:aws
    (c). set CONFIG_ACPI_HOTPLUG_CPU=n in some way

  [Experiment]

    Regarding (b), I cooked up a workaround patch (dirty hack), and confirmed 
that acpi_processor_get_info()
    turns to succeed for all CPU#4-15 and the warn messages disappeared. See 
the attached.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/2088047/+subscriptions


-- 
Mailing list: https://launchpad.net/~canonical-ubuntu-qa
Post to     : canonical-ubuntu-qa@lists.launchpad.net
Unsubscribe : https://launchpad.net/~canonical-ubuntu-qa
More help   : https://help.launchpad.net/ListHelp

Reply via email to