Package: prometheus-node-exporter Severity: normal X-Debbugs-Cc: chrisk...@gmail.com
This is an odd one, that is very hard to trigger. I am acutely aware of what makes a good bug report and fully understand that this is not one of them, so please let me know what I can do to help create more clarity here. I've observed on several systems that run both prometheus-node-exporter's systemd collector there is an eventual failure mode in which systemd calls never come back. I encountered it primarily due to using pacemaker's systemd OCF type, which also communicates with systemd over a dbus socket. The failure that would occur would be after a period of 60-90 days of uptime, dbus messages to systemd would begin to timeout with an error like: prometheus-node-exporter[4315]: time="2022-11-28T12:25:30-05:00" level=error msg="ERROR: systemd collector failed after 0.082480s: couldn't get units: Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)" source="collector.go:132" This would continue until systemd would be incidentally restarted by some other operation, or until the node rebooted. In some cases, systemd could not be recovered and a reset would need to occur. Unfortnately I have never been able to find firm evidence of the underlying defect, and was only turned on to the possibility of the issue through conversations with systemd contributors at conferences who firmly advised me NOT to use the dbus-systemd connector to talk to the daemon, as they were aware of some irregular buggy behavior that could cause socket failure. I was able to reasonably prove to myself that the collector was at fault in some measure by modifying it run at a high frequency and observing that the aforementioned failure triggered more rapidly (from 60-90 days to 5-10 days). The issue seems to increase in likelihood (decrease in reproduction time) the more communication there is over that socket. I'd recommend that Debian revert its patch that enables the systemd-collector by default, unless it can be demonstrated that this bug does not recur in bookworm+. I have seen reproductions it on Buster and Bullseye. -- System Information: Debian Release: 12.5 APT prefers stable-updates APT policy: (500, 'stable-updates'), (500, 'stable-security'), (500, 'stable') Architecture: amd64 (x86_64) Kernel: Linux 6.1.0-21-amd64 (SMP w/128 CPU threads; PREEMPT) Kernel taint flags: TAINT_PROPRIETARY_MODULE, TAINT_OOT_MODULE, TAINT_UNSIGNED_MODULE Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE not set Shell: /bin/sh linked to /usr/bin/dash Init: systemd (via /run/systemd/system) LSM: AppArmor: enabled Versions of packages prometheus-node-exporter depends on: ii adduser 3.134 ii init-system-helpers 1.65.2 ii libc6 2.36-9+deb12u7 Versions of packages prometheus-node-exporter recommends: ii dbus 1.14.10-1~deb12u1 pn prometheus-node-exporter-collectors <none> prometheus-node-exporter suggests no packages.