I just want to follow up on this that I believe I have found the reason for this bug and I have a solution which I am testing out.
In the output of `systemctl status systemd-journald` it says: Active: failed (Result: start-limit) and this is related to the systemd mechanism to stop cycles where services keep restarting. It's related to the configs StartLimitIntervalSec= [defaults to DefaultStartLimitIntervalSec=10 sec by default] StartLimitBurst= [defaults to DefaultStartLimitBurst=5 by default] so that if a service is restarted 5 times within 10 seconds, it's not restarted any more to avoid wasting resources. What seems to happen is that if systemd-journal-flush.service sends the signal SIGUSR1 to systemd-journald and systemd-journald wasn't ready to handle it for some reason (e.g. had just been started, or there was some hangup), then systemd-journald is restarted, and systemd-journal-flush.service immediately restarts also. Again systemd-journald won't be ready to handle the signal as it hasn't had time to run the signal handler, and the cycle repeats 5 times until the systemd-journald service hits the StartLimitBurst and dies. I verified using `lastcomm` (we have process accounting enabled) that this did in fact happen. The way I am trying to fix it is to introduce a restart delay of 10 seconds into the systemd-journal-flush service. See the config file below; the only new line that differs from the defaults in /lib/systemd/system/systemd-journal-flush.service is the line RestartSec=10. Dan ====== cat /etc/systemd/system/systemd-journal-flush.service # This file is part of systemd. # # systemd is free software; you can redistribute it and/or modify it # under the terms of the GNU Lesser General Public License as published by # the Free Software Foundation; either version 2.1 of the License, or # (at your option) any later version. [Unit] Description=Trigger Flushing of Journal to Persistent Storage Documentation=man:systemd-journald.service(8) man:journald.conf(5) DefaultDependencies=no Requires=systemd-journald.service After=systemd-journald.service local-fs.target remote-fs.target Before=systemd-user-sessions.service # Don't re-start this service too often, or we can get into a cycle where # if this service was run just after systemd-journald was started, it # fails to handle the signal, causing that service to be restarted # and this servie to be immediately restarted also, until we hit # the limit of StartLimitBurst=5 restarts and systemd-journald # dies. RestartSec=10 [Service] ExecStart=/bin/systemctl kill --kill-who=main --signal=SIGUSR1 systemd-journald.service Type=oneshot On Sat, Oct 15, 2016 at 9:48 PM, Daniel Povey <dpo...@gmail.com> wrote: > Another observation about this bug, which might be helpful. > > If the signal is sent to systemd-journald via > /bin/systemctl kill --kill-who=main --signal=SIGUSR1 > systemd-journald.service > then messages like the following show up in the kernel messages from > `dmesg -T`, like: > > [Sat Oct 15 21:02:35 2016] systemd-journald[26517]: Received request to > flush runtime journal from PID 1 > > but they don't show up in the output of 'journalctl -r'. > In /etc/systemd/journald.conf, it says: > > #MaxLevelStore=debug > #MaxLevelSyslog=debug > > so I would have thought the same messages would go to both places. > I don't know if I'm misunderstanding something here.. > > Dan > > > > > > On Sat, Oct 15, 2016 at 9:07 PM, Daniel Povey <dpo...@gmail.com> wrote: > >> BTW, I attach the output from `systemd-analyze dump`, as dump.txt. >> It would be great if the debian people could help us look into this. >> Lennart has a policy that his team will only look into bug reports in the >> latest two versions of systemd, and obviously we are well behind that. >> >> Dan >> >> >> On Sat, Oct 15, 2016 at 8:53 PM, Daniel Povey <dpo...@gmail.com> wrote: >> >>> >>> I just want to report that we are suffering from this bug, and it is >>> quite frequent. >>> This is with version 215-17+deb8u5 . >>> >>> root@a12:~# systemctl status systemd-journald >>> >>> *** systemd-journald.service - Journal Service >>> >>> Loaded: loaded (/lib/systemd/system/systemd-journald.service; static) >>> >>> Active: *failed* (Result: start-limit) since Sat 2016-10-15 12:01:57 >>> EDT; 8h ago >>> >>> Docs: man:systemd-journald.service(8) >>> >>> man:journald.conf(5) >>> >>> Process: 51561 ExecStart=/lib/systemd/systemd-journald *(code=killed, >>> signal=USR1)* >>> >>> Main PID: 51561 (code=killed, signal=USR1) >>> >>> >>> >> >
_______________________________________________ Pkg-systemd-maintainers mailing list Pkg-systemd-maintainers@lists.alioth.debian.org http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-systemd-maintainers