--- Begin Message ---
Package: linux-image-amd64
Version: 5.10.178-3
we have a small fleet of Debian servers, some of them are KVM guests.
After an upgrade from 11.6 to 11.7 some of them did not boot up properly
or at all. We narrow down the problem to (probably) the 5.10.0-22-amd64
kernel, because servers boot properly with the 5.10.0-21-amd64 kernel.
All misbehaving machines are KVM guests.
Additionally, the problem seems to be related to only one of our
KVM hypervisors (one server), which runs standard Debian Bullseye
(not yet upgraded to 11.7), but with 6.1.0-0.deb11.5-amd64 kernel
from backports.
A short simple description of the situation:
* What worked: fully updated Debian 11.6 (5.10.0-21-amd64)
* What is not working: fully updated Debian 11.7 (5.10.0-22-amd64)
* What is working: fully updated Debian 11.7 with the previous kernel
(5.10.0-21-amd64)
Expected behaviour: a working fully updated Debian 11.7
with 5.10.0-22-amd64 kernel
In the logs we found a lot of segfaults related to libc/ld. I'm pasting
them below. This is from only one machine, other guests behave
similarly.
Hypervisor details:
* Linux tor 6.1.0-0.deb11.5-amd64 #1 SMP PREEMPT_DYNAMIC Debian
6.1.12-1~bpo11+1 (2023-03-05) x86_64 GNU/Linux
* AMD EPYC 75F3 32-Core Processor
* ii libc6:amd64 2.31-13+deb11u5
amd64 GNU C Library: Shared libraries
Guest libc6 details (KVM guest):
# dpkg -l | grep libc6
ii libc6:amd64 2.31-13+deb11u6 amd64
GNU C Library: Shared libraries
ii libc6-dev:amd64 2.31-13+deb11u6 amd64
GNU C Library: Development Libraries and Header Files
Reviewing all the logs did not show any reason for such behaviour,
segfaulst are staring to appear at different moments without apparent
reason. Switching back to the previous kernel (5.10.0-21-amd64)
resolves the issue. Hypervisor logs also did not show anything that
could suggest what the problem is.
May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.672812]
(mount)[209]: segfault at 7f7083027068 ip 00007f7082e8c250 sp
00007ffc276302d8 err
or 25 in libsystemd-shared-247.so[7f7082dec000+18d000]
May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.672822] Code:
Unable to access opcode bytes at RIP 0x7f7082e8c226.
May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.720393] fuse:
init (API version 7.32)
May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.720569] xfs
filesystem being remounted at / supports timestamps until 2038 (0x7fffffff)
May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.746690]
loadkeys[239]: segfault at 7fff01beebf8 ip 00007f8eb43b23d6 sp
00007fff01beebf8 er
ror 25 in libc-2.31.so[7f8eb42e9000+159000]
May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.746695] Code:
Unable to access opcode bytes at RIP 0x7f8eb43b23ac.
May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.778082]
systemd-udevd[279]: segfault at 7f1cfccccb30 ip 00007f1cfcbc3a12 sp
00007ffee5789720 error 25 in libc-2.31.so[7f1cfcb1e000+159000]
May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.778087] Code:
Unable to access opcode bytes at RIP 0x7f1cfcbc39e8.
May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.778352]
systemd-udevd[283]: segfault at 7f1cfccccb30 ip 00007f1cfcbc3a12 sp
00007ffee5789720 error 25 in libc-2.31.so[7f1cfcb1e000+159000]
May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.778355] Code:
Unable to access opcode bytes at RIP 0x7f1cfcbc39e8.
May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.778825]
systemd-udevd[288]: segfault at 7f1cfccccb30 ip 00007f1cfcbc3a12 sp
00007ffee5789720 error 25 in libc-2.31.so[7f1cfcb1e000+159000]
May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.778828] Code:
Unable to access opcode bytes at RIP 0x7f1cfcbc39e8.
May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.782195]
systemd-udevd[290]: segfault at 7ffee57897f8 ip 00007f1cfcbf7d1e sp
00007ffee57897f8 error 25 in libc-2.31.so[7f1cfcb1e000+159000]
May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.782198] Code:
Unable to access opcode bytes at RIP 0x7f1cfcbf7cf4.
May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.782729]
systemd-udevd[271]: segfault at 7f1cfccccb30 ip 00007f1cfcbc3a12 sp
00007ffee5789720 error 25 in libc-2.31.so[7f1cfcb1e000+159000]
May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.782732] Code:
Unable to access opcode bytes at RIP 0x7f1cfcbc39e8.
May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.784499]
systemd-udevd[278]: segfault at 7f1cfccccb30 ip 00007f1cfcbc3a12 sp
00007ffee5789720 error 25 in libc-2.31.so[7f1cfcb1e000+159000]
May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.784504] Code:
Unable to access opcode bytes at RIP 0x7f1cfcbc39e8.
May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.785340]
systemd-udevd[285]: segfault at 7f1cfccccb30 ip 00007f1cfcbc3a12 sp
00007ffee5789720 error 25 in libc-2.31.so[7f1cfcb1e000+159000]
May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.785344] Code:
Unable to access opcode bytes at RIP 0x7f1cfcbc39e8.
May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.802642] input:
Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input4
May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.837995] ACPI:
Power Button [PWRF]
May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.856081] input:
PC Speaker as /devices/platform/pcspkr/input/input5
May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.858827] sr
0:0:0:0: Attached scsi generic sg0 type 5
May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.873806] pstore:
Using crash dump compression: deflate
May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.873810] pstore:
Registered efi as persistent store backend
May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.874814]
iTCO_vendor_support: vendor-support=0
May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.876912]
iTCO_wdt: Intel TCO WatchDog Timer Driver v1.11
May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.876965]
iTCO_wdt: Found a ICH9 TCO device (Version=2, TCOBASE=0x0660)
May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.878544]
iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0)
May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.880777] cryptd:
max_cpu_qlen set to 1000
May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.898590] AVX2
version of gcm_enc/dec engaged.
May 3 23:19:45 gitlab-runner-portalprod kernel: [ 1.898591] AES CTR
mode by8 optimization enabled
May 3 23:19:46 gitlab-runner-portalprod kernel: [ 2.113861] pcieport
0000:00:01.6: pciehp: Slot(0-6): No device found
May 3 23:19:49 gitlab-runner-portalprod kernel: [ 6.018119]
show_signal_msg: 14 callbacks suppressed
May 3 23:19:49 gitlab-runner-portalprod kernel: [ 6.018123]
systemd-udevd[274]: segfault at 7ffee57897f8 ip 00007f1cfcbf7d1e sp
00007ffee57897f8 error 25 in libc-2.31.so[7f1cfcb1e000+159000]
May 3 23:19:49 gitlab-runner-portalprod kernel: [ 6.018134] Code:
Unable to access opcode bytes at RIP 0x7f1cfcbf7cf4.
May 3 23:19:49 gitlab-runner-portalprod kernel: [ 6.019739]
systemd-udevd[296]: segfault at 7ffee5789848 ip 00007f1cfcbf7d1e sp
00007ffee5789848 error 25 in libc-2.31.so[7f1cfcb1e000+159000]
May 3 23:19:49 gitlab-runner-portalprod kernel: [ 6.019764] Code:
Unable to access opcode bytes at RIP 0x7f1cfcbf7cf4.
May 3 23:20:22 gitlab-runner-portalprod kernel: [ 38.854995]
systemctl[527]: segfault at 7ffd14e78928 ip 00007f9edaedaa13 sp
00007ffd14e788f0 error 25 in libc-2.31.so[7f9edae0d000+159000]
May 3 23:20:22 gitlab-runner-portalprod kernel: [ 38.855007] Code:
Unable to access opcode bytes at RIP 0x7f9edaeda9e9.
May 3 23:20:22 gitlab-runner-portalprod kernel: [ 38.855253]
dbus-daemon[433]: segfault at 7ffc1aa74a28 ip 00007f1fec03ed1e sp
00007ffc1aa74a28 error 25 in libc-2.31.so[7f1febf65000+159000]
May 3 23:20:22 gitlab-runner-portalprod kernel: [ 38.855261] Code:
Unable to access opcode bytes at RIP 0x7f1fec03ecf4.
May 3 23:20:29 gitlab-runner-portalprod kernel: [ 44.834159]
pager[544]: segfault at 7ffc88996ea8 ip 00007f65c20003d6 sp
00007ffc88996ea8 error 25 in libc-2.31.so[7f65c1f37000+159000]
May 3 23:20:29 gitlab-runner-portalprod kernel: [ 44.834170] Code:
Unable to access opcode bytes at RIP 0x7f65c20003ac.
May 3 23:21:12 gitlab-runner-portalprod kernel: [ 88.221613]
run-parts[554]: segfault at 7ffe50526628 ip 00007fb22417370e sp
00007ffe50526628 error 25 in libc-2.31.so[7fb2240ce000+159000]
May 3 23:21:12 gitlab-runner-portalprod kernel: [ 88.221622] Code:
Unable to access opcode bytes at RIP 0x7fb2241736e4.
May 11 13:16:03 portal-prod kernel: nft[256]: segfault at 7fc86b408c38
ip 00007fc86b4321ef sp 00007ffcc1dc1e08 error 27 in ld-2.31.so[7fc86b4130>
During this particular boot the first problem was the "nft" segfault,
then serveral others appeard and machine was unusable. I saw a kernel
panic once, but cannot reproduce the behaviour now.
maj 11 13:16:03 portal-prod systemd[1]: Starting Load Kernel Module fuse...
maj 11 13:16:03 portal-prod systemd[1]: Starting nftables...
maj 11 13:16:03 portal-prod systemd[1]: Condition check resulted in Set
Up Additional Binary Formats being skipped.
maj 11 13:16:03 portal-prod systemd[1]: Condition check resulted in File
System Check on Root Device being skipped.
maj 11 13:16:03 portal-prod kernel: nft[256]: segfault at 7fc86b408c38
ip 00007fc86b4321ef sp 00007ffcc1dc1e08 error 27 in ld-2.31.so[7fc86b4130>
maj 11 13:16:03 portal-prod systemd[1]: Starting Journal Service...
maj 11 13:16:03 portal-prod kernel: Code: Unable to access opcode bytes
at RIP 0x7fc86b4321c5.
maj 11 13:16:03 portal-prod kernel: fuse: init (API version 7.32)
maj 11 13:16:03 portal-prod systemd[1]: Starting Load Kernel Modules...
maj 11 13:16:03 portal-prod systemd[1]: Starting Remount Root and Kernel
File Systems...
maj 11 13:16:03 portal-prod systemd[1]: Starting Coldplug All udev
Devices...
maj 11 13:16:03 portal-prod systemd[1]: Mounted Huge Pages File System.
maj 11 13:16:03 portal-prod systemd[1]: Mounted POSIX Message Queue File
System.
maj 11 13:16:03 portal-prod kernel: EXT4-fs (vda3): re-mounted. Opts:
errors=remount-ro
maj 11 13:16:03 portal-prod systemd[1]: Mounted Kernel Debug File System.
maj 11 13:16:03 portal-prod systemd[1]: Mounted Kernel Trace File System.
maj 11 13:16:03 portal-prod systemd[1]: Finished Create list of static
device nodes for the current kernel.
maj 11 13:16:03 portal-prod systemd[1]: modprobe@configfs.service:
Succeeded.
maj 11 13:16:03 portal-prod systemd[1]: Finished Load Kernel Module
configfs.
maj 11 13:16:03 portal-prod systemd[1]: modprobe@drm.service: Succeeded.
maj 11 13:16:03 portal-prod systemd[1]: Finished Load Kernel Module drm.
maj 11 13:16:03 portal-prod systemd[1]: modprobe@fuse.service: Succeeded.
maj 11 13:16:03 portal-prod systemd[1]: Finished Load Kernel Module fuse.
maj 11 13:16:03 portal-prod systemd[1]: nftables.service: Main process
exited, code=killed, status=11/SEGV
maj 11 13:16:03 portal-prod systemd[1]: nftables.service: Failed with
result 'signal'.
maj 11 13:16:03 portal-prod systemd[1]: Failed to start nftables.
maj 11 13:16:03 portal-prod systemd[1]: Finished Load Kernel Modules.
maj 11 13:16:03 portal-prod systemd[1]: Finished Remount Root and Kernel
File Systems.
maj 11 13:16:03 portal-prod systemd[1]: Reached target Network (Pre).
maj 11 13:16:03 portal-prod systemd[1]: Mounting FUSE Control File System...
maj 11 13:16:03 portal-prod systemd[1]: Mounting Kernel Configuration
File System...
maj 11 13:16:03 portal-prod systemd[1]: Condition check resulted in
Rebuild Hardware Database being skipped.
maj 11 13:16:03 portal-prod systemd[1]: Starting Apply Kernel Variables...
maj 11 13:16:03 portal-prod systemd[1]: Starting Create System Users...
maj 11 13:16:03 portal-prod systemd[1]: Mounted Kernel Configuration
File System.
maj 11 13:16:03 portal-prod systemd[1]: Mounted FUSE Control File System.
maj 11 13:16:03 portal-prod systemd[1]: systemd-sysctl.service: Main
process exited, code=killed, status=11/SEGV
maj 11 13:16:03 portal-prod systemd[1]: systemd-sysctl.service: Failed
with result 'signal'.
maj 11 13:16:03 portal-prod systemd[1]: Failed to start Apply Kernel
Variables.
Another boot got to the login page, with several segfaults and no
network, but I was thrown out to the login screen after a moment.
[ 2.095923] systemd[1]: Finished Load Kernel Modules.
[ 2.097033] modprobe[250]: segfault at 7fffcb38df60 ip
00007faa724aabda sp 00007fffcb38df50 error 27 in
ld-2.31.so[7faa724a1000+20000]
[ 2.098711] Code: Unable to access opcode bytes at RIP 0x7faa724aabb0.
[ 2.099747] systemd[1]: Mounting Kernel Configuration File System...
[ 2.101174] systemd[1]: Starting Apply Kernel Variables...
[ 2.101995] systemd[1]: modprobe@fuse.service: Succeeded.
[ 2.102743] systemd[1]: Finished Load Kernel Module fuse.
[ 2.103471] systemd[1]: Condition check resulted in FUSE Control File
System being skipped.
[ 2.105732] systemd[1]: Mounted Kernel Configuration File System.
[ 2.107672] EXT4-fs (vda3): re-mounted. Opts: errors=remount-ro
[ 2.109296] systemd[1]: Finished Remount Root and Kernel File Systems.
[ 2.110564] systemd[1]: Condition check resulted in Rebuild Hardware
Database being skipped.
[ 2.112033] systemd[1]: Starting Create System Users...
[ 2.117134] systemd[1]: Finished Apply Kernel Variables.
[ 2.117844] systemd[1]: systemd-sysusers.service: Main process
exited, code=killed, status=11/SEGV
[ 2.118923] systemd[1]: systemd-sysusers.service: Failed with result
'signal'.
[ 2.119857] systemd[1]: Failed to start Create System Users.
[ 2.120900] systemd[1]: Starting Create Static Device Nodes in /dev...
[ 2.121841] systemd[1]: modprobe@drm.service: Succeeded.
[ 2.122546] systemd[1]: Finished Load Kernel Module drm.
[ 2.137334] systemd[1]: Started Journal Service.
[ 2.170459] input: Power Button as
/devices/LNXSYSTM:00/LNXPWRBN:00/input/input4
[ 2.185042] ACPI: Power Button [PWRF]
[ 2.186245] systemctl[302]: segfault at 7f60950bda66 ip
00007f60951e4bee sp 00007ffd6fc41020 error 25 in
libc-2.31.so[7f60950d2000+159000]
[ 2.188023] Code: Unable to access opcode bytes at RIP 0x7f60951e4bc4.
[ 2.202115] input: PC Speaker as /devices/platform/pcspkr/input/input5
[ 2.203174] sr 0:0:0:0: Attached scsi generic sg0 type 5
[ 2.205279] pstore: Using crash dump compression: deflate
[ 2.205665] pstore: Registered efi as persistent store backend
[ 2.210254] modprobe[322]: segfault at 7ffc6a6d5fc8 ip
00007fabda493093 sp 00007ffc6a6d5fd0 error 27 in
ld-2.31.so[7fabda493000+20000]
[ 2.211566] Code: Unable to access opcode bytes at RIP 0x7fabda493069.
[ 2.217067] Adding 1952764k swap on /dev/vda2. Priority:-2 extents:1
across:1952764k FS
[ 2.221668] systemd-sysuser[331]: segfault at 7f4a6768e090 ip
00007f4a683291ef sp 00007fff92a6dc58 error 27 in
ld-2.31.so[7f4a6830a000+20000]
[ 2.222233] setfont[332]: segfault at 7ffe6ed25fe8 ip
00007f3807b173d6 sp 00007ffe6ed25fe8 error 25 in
libc-2.31.so[7f3807a4e000+159000]
[ 2.222686] Code: Unable to access opcode bytes at RIP 0x7f4a683291c5.
[ 2.223633] Code: Unable to access opcode bytes at RIP 0x7f3807b173ac.
[ 2.225057] iTCO_vendor_support: vendor-support=0
[ 2.225520] cryptd: max_cpu_qlen set to 1000
[ 2.226497] fuse: init (API version 7.32)
[ 2.228231] iTCO_wdt: Intel TCO WatchDog Timer Driver v1.11
[ 2.228728] iTCO_wdt: Found a ICH9 TCO device (Version=2, TCOBASE=0x0660)
[ 2.229398] iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0)
[ 2.232657] AVX2 version of gcm_enc/dec engaged.
[ 2.233042] AES CTR mode by8 optimization enabled
[ 2.287356] EXT4-fs (vdc1): mounted filesystem with ordered data
mode. Opts: (null)
[ 2.288381] EXT4-fs (vdb1): mounted filesystem with ordered data
mode. Opts: (null)
[ 2.293006] systemd-journald[255]: Received client request to flush
runtime journal.
[ 2.301747] FAT-fs (vda1): Volume was not properly unmounted. Some
data may be corrupt. Please run fsck.
[ 2.304947] systemd-journal[255]: segfault at 7f1513dd9e00 ip
00007f15157e8683 sp 00007fff6feda320 error 25 in
libsystemd-shared-247.so[7f151567a000+18d000]
[ 2.306036] Code: Unable to access opcode bytes at RIP 0x7f15157e8659.
[ 2.307205] systemd[1]: systemd-journal-flush.service: Main process
exited, code=exited, status=1/FAILURE
[ 2.308002] systemd[1]: systemd-journal-flush.service: Failed with
result 'exit-code'.
Kind regards,
--
Kamil Wilczek [https://keys.openpgp.org/]
[6C4BE20A90A1DBFB3CBE2947A832BF5A491F9F2A]
--- End Message ---