On Wed, 2024-09-18 at 18:17 +0200, Gregor Riepl wrote: > My first attempt at bisecting ran into lots of compilation issues with the > default config of each version and gcc 14. > All the 4.x and 5.x kernels fail with the following errors (at least, some > versions have more): > > arch/sparc/kernel/mdesc.c: In function 'mdesc_node_by_name': > arch/sparc/kernel/mdesc.c:646:22: error: 'strcmp' reading 1 or more bytes > from a region of size 0 [-Werror=stringop-overread] > 646 | if (!strcmp(names + ep[ret].name_offset, name)) > | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > arch/sparc/kernel/mdesc.c:78:33: note: at offset [32, 8589934606] into source > object 'mdesc' of size 16 > 78 | struct mdesc_hdr mdesc; > | ^~~~~ > ... > In function 'kernel_lds_init', > inlined from 'report_memory' at arch/sparc/mm/init_64.c:3112:2: > arch/sparc/mm/init_64.c:3102:31: error: array subscript -1 is outside array > bounds of 'char[]' [-Werror=array-bounds=] > 3102 | data_resource.end = compute_kern_paddr(_edata - 1); > | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > ./include/asm-generic/sections.h: In function 'report_memory': > ./include/asm-generic/sections.h:36:32: note: at offset -1 into object > '_edata' of size [0, 9223372036854775807] > 36 | extern char _data[], _sdata[], _edata[]; > | ^~~~~~ > ...
Yeah, a lot of warnings were actually fixed in the kernel which are handled as errors if CONFIG_WERROR is set. > Next issue: The default kernel config lacks some essential drivers to make my > system bootable. For my Fire V215, > at least CONFIG_FUSIONMPT and CONFIG_CGROUPS are needed, plus a few other > things. systemd requires cgroups v2 > support theses days. The default configs for 32-bit and 64-bit SPARC could probably see an update here. > I started off with a default config in the first bisect step (corresponding > with 5.14), added the required options, > and then did a make oldconfig in each subsequent step, answering all > questions with the default. "make localmodconfig" is probably easier in this case. > Building with make bindeb-pkg produces an almost usable kernel package. For > some reason, grub-ieee1275 requires an > unpacked kernel, so the installed vmlinuz needed to be gunzipped afterwards. That's not an arbitrary reason, but simply a requirement for GRUB on SPARC due to size limitations. It's documented in the GRUB manual. > Now for the actual testing... triggering a panic/oops reliably was difficult. > The Debian 6.10 kernel usually crashes > relatively quickly on disk I/O, and enabling swap accelerates the effect. > bonnie++ should therefore make for a good > stress test. I haven't found a good reproducer yet, either unfortunately. > I don't have the exact commit IDs of each bisection step, but it was > (roughly) 5.14-rc6, 6.6-rc7, 6.8-rc3, 6.9, 6.10. > > There were a few odd non-critical issues, such as this I/O error with 5.14 > (but nothing in dmesg): > > $ /usr/sbin/bonnie++ > Writing a byte at a time...done > Writing intelligently...done > Rewriting...Can't write block.: Unknown error 2560 > Bonnie: drastic I/O error (re write(2)): Unknown error 2560 Just use "git bisect skip" in this case to skip unreleated regressions. > 6.2 produces this warning at boot: > > [ +21.090317] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: > [ +1.422401] rcu: 0-...!: (1 GPs behind) idle=a29c/0/0x1 softirq=18/19 > fqs=44 > [ +0.093960] (detected by 0, t=2246 jiffies, g=-1175, q=989 ncpus=2) > [ +0.083646] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.2.0-rc7+ #18 > [ +0.083641] TSTATE: 0000004411001605 TPC: 000000000042beac TNPC: > 000000000042beb0 Y: 00000000 Not tainted > [ +0.129479] TPC: <arch_cpu_idle+0x8c/0xa0> > [ +0.053848] g0: 00000000004209d0 g1: 00000000015282c0 g2: 00000000015105c8 > g3: 0000000000000001 > [ +0.114585] g4: fff0000000390ba0 g5: fff000027e2f0000 g6: fff0000000398000 > g7: 00000000173aa294 > [ +0.114582] o0: fff0000000390ba0 o1: 0000000000000001 o2: 000000000130ae78 > o3: 00000000015105c8 > [ +0.114580] o4: 00000000015280c0 o5: 000000000130b580 sp: fff000000039b3d1 > ret_pc: 000000000042bea0 > [ +0.119164] RPC: <arch_cpu_idle+0x80/0xa0> > [ +0.053850] l0: 0000000001407f20 l1: 0000000000022c05 l2: 0000000000000000 > l3: 000000000130b538 > [ +0.114585] l4: 000000000130b400 l5: 0000000000000040 l6: 0000000000000000 > l7: 0000000001408140 > [ +0.114581] i0: 00000000173aa299 i1: fff000027f814990 i2: 0000000000000001 > i3: 0000000000000001 > [ +0.114580] i4: fff000027f814990 i5: 0000000001524990 i6: fff000000039b481 > i7: 0000000000b22f68 > [ +0.114582] I7: <default_idle_call+0x48/0x100> > [ +0.058433] Call Trace: > [ +0.032082] [<0000000000b22f68>] default_idle_call+0x48/0x100 > [ +0.075624] [<00000000004adc28>] do_idle+0x108/0x180 > [ +0.065311] [<00000000004adf34>] cpu_startup_entry+0x14/0x40 > [ +0.074477] [<000000000043ede4>] smp_callin+0xe4/0x120 > [ +0.067603] [<0000000001318614>] 0x1318614 > [ +0.053853] [<0000000040000000>] 0x40000000 FWIW, it could be an idea to run the RCU torture test as a check for bisecting. See: https://docs.kernel.org/RCU/torture.html >From what I know, there are a number of hidden bugs in the RCU implementation >on some architectures. > It also failed to shut down properly: > > [ 1634.268777] systemd-journald[181]: Failed to send WATCHDOG=1 notification > message: Connection refused > [ 1754.268963] systemd-journald[181]: Failed to send WATCHDOG=1 notification > message: Transport endpoint is not connected > > The shutdown got stuck after that. I did not see this with any other kernels. > > From 6.2 onward, The tg3 network driver produces this warning at shutdown > (but it proceeds from there without issue): > > [ 1594.751376] ------------[ cut here ]------------ > [ 1594.812280] WARNING: CPU: 0 PID: 3914 at kernel/irq/msi.c:196 > msi_domain_free_descs+0xdc/0x100 > [ 1594.925813] Modules linked in: binfmt_misc flash sg fuse autofs4 dm_mod > mptsas sr_mod scsi_transport_sas mptscsih ehci_pci mptbase tg3 cdrom ehci_hcd > libphy > [ 1595.110450] CPU: 0 PID: 3914 Comm: ip Not tainted 6.2.0-rc7+ #18 > [ 1595.189586] Call Trace: > [ 1595.221667] [<0000000000465da8>] __warn+0xe8/0x120 > [ 1595.284686] [<0000000000b11088>] warn_slowpath_fmt+0x30/0x70 > [ 1595.359165] [<00000000004cdbfc>] msi_domain_free_descs+0xdc/0x100 > [ 1595.439371] [<00000000004ce878>] msi_domain_free_msi_descs_range+0x18/0x40 > [ 1595.529891] [<0000000000819984>] pci_free_msi_irqs+0x4/0x20 > [ 1595.603222] [<0000000000817e94>] pci_disable_msi+0x54/0x80 > [ 1595.675408] [<00000000100b0464>] tg3_ints_fini+0x64/0xe0 [tg3] > [ 1595.752282] [<00000000100c880c>] tg3_stop+0x22c/0x2c0 [tg3] > [ 1595.825614] [<00000000100c88c0>] tg3_close+0x20/0xa0 [tg3] > [ 1595.897799] [<000000000096c8e8>] __dev_close_many+0x88/0x100 > [ 1595.972278] [<0000000000976c64>] __dev_change_flags+0xa4/0x1e0 > [ 1596.049047] [<0000000000976db8>] dev_change_flags+0x18/0x60 > [ 1596.122378] [<00000000009872a0>] do_setlink+0x2e0/0x1140 > [ 1596.192273] [<000000000098d138>] __rtnl_newlink+0x3f8/0x7e0 > [ 1596.265605] [<000000000098d550>] rtnl_newlink+0x30/0x60 > [ 1596.334353] [<0000000000986a7c>] rtnetlink_rcv_msg+0x27c/0x360 > [ 1596.411144] ---[ end trace 0000000000000000 ]--- > > On 6.6, I got this warning at boot: > > [ +21.089612] rcu: INFO: rcu_sched self-detected stall on CPU > [ +0.000007] rcu: 1-....: (281 ticks this GP) > idle=36cc/1/0x4000000000000002 softirq=28/28 fqs=1050 > [ +0.000012] rcu: (t=2101 jiffies g=-1175 q=1029 ncpus=2) > [ +0.000007] CPU: 1 PID: 1 Comm: swapper/1 Not tainted 6.6.0-rc7+ #19 > [ +0.000008] TSTATE: 0000004411001602 TPC: 00000000004c23f0 TNPC: > 00000000004c23f4 Y: 00001f91 Not tainted > [ +0.000005] TPC: <console_flush_all+0x1d0/0x4a0> > [ +0.000018] g0: 00000000004c23f0 g1: 000000000154bca0 g2: 0000000000000000 > g3: 00000000016e1400 > [ +0.000004] g4: fff0001004510000 g5: fff000103d2b6000 g6: fff0001004658000 > g7: 000000000000000e > [ +0.000004] o0: 00000000016e17f8 o1: 0000000000000000 o2: 0000000000000000 > o3: 000000000000004d > [ +0.000004] o4: 00000000016e0bd8 o5: 0000000001753250 sp: fff000100465a9c1 > ret_pc: 00000000004c23e4 > [ +0.000004] RPC: <console_flush_all+0x1c4/0x4a0> > [ +0.000007] l0: 000000000133b078 l1: 0000000000000000 l2: 0000000000000000 > l3: 0000000000000000 > [ +0.000004] l4: 0000000001435400 l5: 0000000000000000 l6: 00000000016e0bd8 > l7: 00000000014b0840 > [ +0.000004] i0: 0000000000000000 i1: fff000100465b368 i2: fff000100465b367 > i3: 00000000016e1400 > [ +0.000004] i4: 00000000016e0bd8 i5: 00000000016e17f8 i6: fff000100465aab1 > i7: 00000000004c2730 > [ +0.000004] I7: <console_unlock+0x70/0xe0> > [ +0.000008] Call Trace: > [ +0.000003] [<00000000004c2730>] console_unlock+0x70/0xe0 > [ +0.000007] [<00000000004c3c8c>] vprintk_emit+0x1cc/0x220 > [ +0.000009] [<0000000000b32aa4>] _printk+0x24/0x34 > [ +0.000014] [<00000000008851e8>] serial_core_register_port+0x468/0x6c0 > [ +0.000007] [<0000000000888998>] su_probe+0x178/0x3c0 > [ +0.000009] [<0000000000898fe8>] platform_probe+0x28/0x80 > [ +0.000006] [<0000000000896bf8>] really_probe+0xb8/0x2e0 > [ +0.000011] [<0000000000896f04>] driver_probe_device+0x24/0xe0 > [ +0.000007] [<0000000000897104>] __driver_attach+0x64/0x120 > [ +0.000007] [<0000000000894c10>] bus_for_each_dev+0x50/0xa0 > [ +0.000007] [<0000000000895d3c>] bus_add_driver+0x17c/0x1e0 > [ +0.000006] [<00000000008979d4>] driver_register+0x74/0x120 > [ +0.000008] [<000000000151ab90>] sunsu_init+0x170/0x1d4 > [ +0.000009] [<0000000000427bf4>] do_one_initcall+0x34/0x220 > [ +0.000008] [<00000000014f8fb4>] kernel_init_freeable+0x210/0x274 > [ +0.000012] [<0000000000b3c1bc>] kernel_init+0x18/0x13c > > On 6.6, I also found these messages in the kernel log (but apparently no > negative consequences): > > [ +0.371437] Kernel unaligned access at TPC[4ea294] load_module+0xff4/0x1c60 > [ +0.091825] Kernel unaligned access at TPC[4ea294] load_module+0xff4/0x1c60 > [ +0.091734] Kernel unaligned access at TPC[4ea294] load_module+0xff4/0x1c60 > [ +0.091763] Kernel unaligned access at TPC[4ea294] load_module+0xff4/0x1c60 > [ +0.091757] Kernel unaligned access at TPC[4ea294] load_module+0xff4/0x1c60 > [ +0.252176] log_unaligned: 4200 callbacks suppressed > [ +0.055120] Kernel unaligned access at TPC[4e75e0] cmp_name+0x0/0x20 > [ +0.000023] Kernel unaligned access at TPC[4e75e0] cmp_name+0x0/0x20 > [ +0.000009] Kernel unaligned access at TPC[4e75e0] cmp_name+0x0/0x20 > [ +0.000009] Kernel unaligned access at TPC[4e75e0] cmp_name+0x0/0x20 > > > Conclusion: > > It looks very much like it isn't specifically a kernel bug at all, but either > something > wrong with the Debian kernel config, or with newer gcc versions. I still think it's a kernel bug. > I will test some other gcc versions next. > > Unfortunately, I couldn't test the config from the Debian > linux-image-6.10.7-sparc64-smp package. > Trying to build a kernel with this config produced a 700MB package, and the > resulting initrd was > too large to fit into my boot partition. Is there something special about how > Debian builds kernel packages? It's probably due to CONFIG_DEBUG_INFO. Try turning that off. Debian builds the kernel with debug symbols enabled and then runs the strip command afterwards. This way both a debug and a standard kernel package can be provided from the same build. Adrian -- .''`. John Paul Adrian Glaubitz : :' : Debian Developer `. `' Physicist `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913