Package: prometheus-node-exporter
Severity: normal
X-Debbugs-Cc: chrisk...@gmail.com

This is an odd one, that is very hard to trigger. I am acutely aware of what 
makes a good bug report and fully understand that this is not one of them, so 
please let me know what I can do to help create more clarity here.

I've observed on several systems that run both prometheus-node-exporter's 
systemd collector there is an eventual failure mode in which systemd calls 
never come back. I encountered it primarily due to using pacemaker's systemd 
OCF type, which also communicates with systemd over a dbus socket. The failure 
that would occur would be after a period of 60-90 days of uptime, dbus messages 
to systemd would begin to timeout with an error like:

prometheus-node-exporter[4315]: time="2022-11-28T12:25:30-05:00" level=error 
msg="ERROR: systemd collector failed after 0.082480s: couldn't get units: 
Failed to activate service 'org.freedesktop.systemd1': timed out 
(service_start_timeout=25000ms)" source="collector.go:132"

This would continue until systemd would be incidentally restarted by some other 
operation, or until the node rebooted. In some cases, systemd could not be 
recovered and a reset would need to occur.

Unfortnately I have never been able to find firm evidence of the underlying 
defect, and was only turned on to the possibility of the issue through 
conversations with systemd contributors at conferences who firmly advised me 
NOT to use the dbus-systemd connector to talk to the daemon, as they were aware 
of some irregular buggy behavior that could cause socket failure. 

I was able to reasonably prove to myself that the collector was at fault in 
some measure by modifying it run at a high frequency and observing that the 
aforementioned failure triggered more rapidly (from 60-90 days to 5-10 days). 
The issue seems to increase in likelihood (decrease in reproduction time) the 
more communication there is over that socket.

I'd recommend that Debian revert its patch that enables the systemd-collector 
by default, unless it can be demonstrated that this bug does not recur in 
bookworm+. I have seen reproductions it on Buster and Bullseye.


-- System Information:
Debian Release: 12.5
  APT prefers stable-updates
  APT policy: (500, 'stable-updates'), (500, 'stable-security'), (500, 'stable')
Architecture: amd64 (x86_64)

Kernel: Linux 6.1.0-21-amd64 (SMP w/128 CPU threads; PREEMPT)
Kernel taint flags: TAINT_PROPRIETARY_MODULE, TAINT_OOT_MODULE, 
TAINT_UNSIGNED_MODULE
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE not set
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages prometheus-node-exporter depends on:
ii  adduser              3.134
ii  init-system-helpers  1.65.2
ii  libc6                2.36-9+deb12u7

Versions of packages prometheus-node-exporter recommends:
ii  dbus                                 1.14.10-1~deb12u1
pn  prometheus-node-exporter-collectors  <none>

prometheus-node-exporter suggests no packages.

Reply via email to