Bug#793814: I just ran into this on debian testing
Package: systemd Version: 225-1 My fatal symptoms are that Postfix SASL stops working with Sep 17 07:11:11 bedivere postfix/smtpd[20955]: warning: unknown[x.x.x.x]: SASL PLAIN authentication failed: Connection lost to authentication server Which is really annoying because it requires fixing by restarting systemd-logind on the server and then restarting evolution on the client. I also get the delayed login problem until systemd-logind is restarted, but that's minor compared to the postfix failure, which triggers complaints from my users. James ___ Pkg-systemd-maintainers mailing list Pkg-systemd-maintainers@lists.alioth.debian.org http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-systemd-maintainers
Bug#800087: systemd lists running daemons as failed after reboot
On Wed, 2015-09-30 at 15:20 +0200, Michael Biebl wrote: > Control: tags -1 + moreinfo > > Am 26.09.2015 um 18:08 schrieb James Bottomley: > > Package: systemd > > Version: 226-3 > > Severity: normal > > > > rebooting the current system often causes listed init system failures. > > in addition to failing to start spamd (listed under separate bug). > > Systemd thinks units have failed when, in fact, they're running. The > > most common units for this are apache2 and mysql. > > > > This is the state of a recent reboot, where apache2 is listed as failed > > > > bedivere:~# systemctl --failed > > UNITLOAD ACTIVE SUBDESCRIPTION > > ● apache2.service loaded failed failed LSB: Apache2 web server > > > > LOAD = Reflects whether the unit definition was properly loaded. > > ACTIVE = The high-level unit activation state, i.e. generalization of SUB. > > SUB= The low-level unit activation state, values depend on unit type. > > > > 1 loaded units listed. Pass --all to see loaded but inactive units, too. > > To show all installed unit files use 'systemctl list-unit-files'. > > > > Whereas ps shows it to be running > > What's the output of > systemctl status apache2.service ? OK, so the machine only has a maintenance window on saturday. I rebooted and asked this and this is the result: bedivere:~# systemctl status -l apache2.service ● apache2.service - LSB: Apache2 web server Loaded: loaded (/etc/init.d/apache2) Active: failed (Result: exit-code) since Sat 2015-10-03 10:45:42 PDT; 1min 53s ago Docs: man:systemd-sysv-generator(8) Process: 693 ExecStart=/etc/init.d/apache2 start (code=exited, status=1/FAILURE) CGroup: /system.slice/apache2.service ├─1407 /usr/sbin/apache2 -k start ├─1410 PassengerAgent watchdog ├─1417 PassengerAgent server ├─1422 PassengerAgent logger ├─1699 /usr/sbin/apache2 -k start ├─1701 /usr/sbin/apache2 -k start ├─1702 /usr/sbin/apache2 -k start ├─1704 /usr/sbin/apache2 -k start └─1706 /usr/sbin/apache2 -k start Oct 03 10:44:38 bedivere systemd[1]: Starting LSB: Apache2 web server... Oct 03 10:45:06 bedivere apache2[693]: Starting web server: apache2AH00180: WARNING: MaxRequestWorkers of 20 exceeds ServerLimit value of Oct 03 10:45:06 bedivere apache2[693]: 5 servers, decreasing MaxRequestWorkers to 5. Oct 03 10:45:06 bedivere apache2[693]: To increase, please see the ServerLimit directive. Oct 03 10:45:42 bedivere apache2[693]: failed! Oct 03 10:45:42 bedivere apache2[693]: The apache2 instance did not start within 20 seconds. Please read the log files to discover problems ... (warning). Oct 03 10:45:42 bedivere systemd[1]: apache2.service: Control process exited, code=exited status=1 Oct 03 10:45:42 bedivere systemd[1]: Failed to start LSB: Apache2 web server. Oct 03 10:45:42 bedivere systemd[1]: apache2.service: Unit entered failed state. Oct 03 10:45:42 bedivere systemd[1]: apache2.service: Failed with result 'exit-code'. James ___ Pkg-systemd-maintainers mailing list Pkg-systemd-maintainers@lists.alioth.debian.org http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-systemd-maintainers
Bug#800087: closed by Michael Biebl (Re: Bug#800087: systemd lists running daemons as failed after reboot)
On Sun, 2015-10-04 at 23:06 +, Debian Bug Tracking System wrote: > Well, this is apparently an apache configuration problem, causing the > service to return a non-zero exit code. > > Once you fix that, the service should not be marked as failed. That's pretty obviously an incorrect deduction, isn't it? If it were a config file problem leading to the control process erroring out then the manual systemctl start apache would also fail; but, as you can see from the description in the original report, it doesn't. The logs have this in them: [ 2015-10-03 10:41:55.6129 1309/b5cfeb40 age/Hel/Main.cpp:455 ]: Signal received. Gracefully shutting down... (send signal 2 more time(s) to force shutdown) [ 2015-10-03 10:41:55.6160 1319/b6960b40 age/Log/Main.cpp:349 ]: Signal received. Gracefully shutting down... (send signal 2 more time(s) to force shutdown) [ 2015-10-03 10:45:21.1489 1353/b5dfeb40 age/Hel/Main.cpp:455 ]: Signal received. Gracefully shutting down... (send signal 2 more time(s) to force shutdown) [ 2015-10-03 10:45:21.1521 1374/b68fab40 age/Log/Main.cpp:349 ]: Signal received. Gracefully shutting down... (send signal 2 more time(s) to force shutdown) So apache dies because something external kills it during boot up. Hm, I wonder what that could be ... James ___ Pkg-systemd-maintainers mailing list Pkg-systemd-maintainers@lists.alioth.debian.org http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-systemd-maintainers
Bug#793814: Workaround for this bug
Since there doesn't seem to have been any progress on this from the systemd side, and I'm really tired of my email users yelling at me (justifiably since once the systemd-logind service dies, dovecot sasl fails to work and they can't send email), this is the script I came up with to alleviate the problem: --- #!/bin/bash ## # shell script to restart systemd.logind when it fails ## tail -F /var/log/auth.log|while read junk; do case $junk in *pam_systemd*org.freedesktop.login1*timed\ out*) echo $junk systemctl restart systemd-logind.service ;; esac done --- You spawn it from init (redirected to a log) and it will monitor the authentication log for any indication that systemd-logind has dropped off the dbus and respawn it. The effect is that even if they get a SASL authentication failure one time, it will already be working the next time they try. James ___ Pkg-systemd-maintainers mailing list Pkg-systemd-maintainers@lists.alioth.debian.org http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-systemd-maintainers
Bug#793814: Workaround for this bug
On Thu, 2015-10-15 at 00:30 +0200, Michael Biebl wrote: > Am 15.10.2015 um 00:16 schrieb James Bottomley: > > Since there doesn't seem to have been any progress on this from the > > systemd side, and I'm really tired of my email users yelling at me > > (justifiably since once the systemd-logind service dies, dovecot sasl > > fails to work and they can't send email), this is the script I came up > > with to alleviate the problem: > > So are you restarting dbus as well for some reason? You can't really restart dbus in a systemd system... However, the problem seems to be the endpoint stops listening. systemd-logind is still running but strace -p shows it no longer responds when something contacts the end point. If you restart it, it will listen for a while before stopping again. The script simply detects the dbus timeout and restarts the service. James signature.asc Description: This is a digitally signed message part ___ Pkg-systemd-maintainers mailing list Pkg-systemd-maintainers@lists.alioth.debian.org http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-systemd-maintainers
Bug#793814: Workaround for this bug
On Thu, 2015-10-15 at 01:06 +0200, Michael Biebl wrote: > Am 15.10.2015 um 00:34 schrieb James Bottomley: > > On Thu, 2015-10-15 at 00:30 +0200, Michael Biebl wrote: > >> Am 15.10.2015 um 00:16 schrieb James Bottomley: > >>> Since there doesn't seem to have been any progress on this from the > >>> systemd side, and I'm really tired of my email users yelling at me > >>> (justifiably since once the systemd-logind service dies, dovecot sasl > >>> fails to work and they can't send email), this is the script I came up > >>> with to alleviate the problem: > >> > >> So are you restarting dbus as well for some reason? > > > > You can't really restart dbus in a systemd system... However, the > > problem seems to be the endpoint stops listening. systemd-logind is > > still running but strace -p shows it no longer responds when something > > contacts the end point. If you restart it, it will listen for a while > > before stopping again. > > If in your case you are not restarting dbus, can you isolate what's > causing logind to stop responding on your system? > I.e., can you provide us with steps how to reproduce the issue. It happens naturally in a running system. This one supplies sparkleshare sync services for shared and backup directories, so there are a huge number of micro sshd logins doing syncs, SASL via dovecot for postfix and dovecot itself. It's not based on LDAP, so everything goes via the password file based authenticator. The logs indicate that from reboot to first timeout took about 20 hours. Thereafter, the timeouts seem to increase in frequency to about two or three an hour. James signature.asc Description: This is a digitally signed message part ___ Pkg-systemd-maintainers mailing list Pkg-systemd-maintainers@lists.alioth.debian.org http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-systemd-maintainers
Bug#793814: Workaround for this bug
On Thu, 2015-10-15 at 07:35 +0200, Stefano wrote: > Same here. I have a git server with clients connecting every 10 minutes or > so via SSH, and the problem is exactly the same. > > About 20 hours from reboot logind goes nuts, then it must be restarted at > very variable intervals, about a couple a day. What architecture is this? I only see the problems on a 32 bit x86 system. I don't see them on a 64 bit x86 in spite of trying to trigger them there (in a sample size of 1 of each, so it's not statistically significant ... yet). I also see that /run/systemd/sessions has about a hundred or so sessions whose leaders are dead on the 32 bit system. On the 64 bit system it appears only still living sessions are in here. The final curious observation is that on the 32 bit system, the session id never seems to go over 4k. On the 64 bit system, it does. James ___ Pkg-systemd-maintainers mailing list Pkg-systemd-maintainers@lists.alioth.debian.org http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-systemd-maintainers
Bug#793814: Workaround for this bug
On Thu, 2015-10-15 at 19:45 +0200, Michael Biebl wrote: > Btw, in both cases (#770135 and #793814) the users have been restarting > the dbus daemon. You mentioned that you did not do that. > > So I'm not sure if it's actually the same issue after all and your > problem should be tracked as a separate issue. I'm starting to think it's a logind robustness problem. I traced back through the logs to the first instance after reboot and this is what I find: Oct 11 04:17:01 bedivere CRON[2185]: pam_unix(cron:session): session opened for user root by (uid=0) Oct 11 04:17:01 bedivere CRON[2185]: pam_unix(cron:session): session closed for user root Oct 11 04:17:40 bedivere sshd[2193]: Accepted publickey for clh15 from 184.11.141.41 port 38172 ssh2: DSA SHA256:UK5h9LRMtBthvW0Ncv1SG4WRmSFNs1hPcowPzzyt+iY Oct 11 04:17:40 bedivere sshd[2193]: pam_unix(sshd:session): session opened for user clh15 by (uid=0) Oct 11 04:17:40 bedivere systemd-logind[640]: New session 916 of user clh15. Oct 11 04:17:40 bedivere sshd[2195]: Accepted publickey for clh15 from 184.11.141.41 port 38173 ssh2: DSA SHA256:UK5h9LRMtBthvW0Ncv1SG4WRmSFNs1hPcowPzzyt+iY Oct 11 04:17:40 bedivere sshd[2195]: pam_unix(sshd:session): session opened for user clh15 by (uid=0) Oct 11 04:17:40 bedivere systemd-logind[640]: New session 917 of user clh15. Oct 11 04:17:40 bedivere sshd[2195]: pam_systemd(sshd:session): Failed to create session: Message recipient disconnected from message bus without replying Oct 11 04:17:40 bedivere systemd-logind[640]: Failed to abandon session scope: Transport endpoint is not connected Oct 11 04:17:40 bedivere sshd[2204]: Received disconnect from 184.11.141.41: 11: disconnected by user Oct 11 04:17:40 bedivere sshd[2204]: Disconnected from 184.11.141.41 Oct 11 04:17:40 bedivere sshd[2193]: pam_unix(sshd:session): session closed for user clh15 Oct 11 04:17:40 bedivere sshd[2203]: Received disconnect from 184.11.141.41: 11: disconnected by user Oct 11 04:17:40 bedivere sshd[2203]: Disconnected from 184.11.141.41 Oct 11 04:17:40 bedivere sshd[2195]: pam_unix(sshd:session): session closed for user clh15 Oct 11 04:18:05 bedivere sshd[2193]: pam_systemd(sshd:session): Failed to release session: Connection timed out Oct 11 04:18:05 bedivere systemd-logind[640]: Failed to abandon session scope: Transport endpoint is not connected Oct 11 04:18:05 bedivere dbus[698]: [system] Failed to activate service 'org.freedesktop.login1': timed out Thereafter everything fails to activate org.freedesktop.login1. However, it looks like the trouble is here: Oct 11 04:17:40 bedivere sshd[2195]: pam_systemd(sshd:session): Failed to create session: Message recipient disconnected from message bus without replying Logind is still active and replies later in the trace, so it looks like dbus either dropped a message or did some type of unexpected disconnect. After this, logind works until it can't abandon the session, then it never replies on the bus again. So I suspect somewhere in the error handling inside logind it doesn't cope with unexpected loss of dbus messages. James signature.asc Description: This is a digitally signed message part ___ Pkg-systemd-maintainers mailing list Pkg-systemd-maintainers@lists.alioth.debian.org http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-systemd-maintainers
Bug#775578: systemd kills spamassassin on system start
On Thu, 2016-02-11 at 23:41 +0100, Michael Biebl wrote: > Control: tags -1 + moreinfo > > On Sat, 17 Jan 2015 09:08:32 -0800 James Bottomley > wrote: > > Package: systemd > > Version: 215-8 > > Severity: normal > > > > Almost every time the system reboots, spamassassin fails to start. > > The systemd logs for this are: > > > > # systemctl status -l spamassassin.service > > â— spamassassin.service - Perl-based spam filter using text > > analysis > >Loaded: loaded (/lib/systemd/system/spamassassin.service; > > enabled) > >Active: failed (Result: timeout) since Sat 2015-01-17 08:49:04 > > PST; 3min 45s ago > > Process: 528 ExecStart=/usr/sbin/spamd -d - > > -pidfile=/var/run/spamassassin.pid $OPTIONS (code=killed, > > signal=TERM) > > > > Jan 17 08:48:10 bedivere spamd[528]: logger: removing stderr method > > Jan 17 08:49:04 bedivere systemd[1]: spamassassin.service start > > operation timed out. Terminating. > > Jan 17 08:49:04 bedivere systemd[1]: Failed to start Perl-based > > spam filter using text analysis. > > Jan 17 08:49:04 bedivere systemd[1]: Unit spamassassin.service > > entered failed state. > > Jan 17 08:49:04 bedivere spamd[748]: spamd: server killed by > > SIGTERM, shutting down > > Jan 17 08:49:04 bedivere spamd[748]: spamd: cannot unlink > > /var/run/spamassassin.pid: No such file or directory > > > > This server is still x86 and a big internet system, so it has lots > > of > > intensive processes on start, like fail2ban , clamd and apache. It > > looks like because of this, systemd gives spamassassin a few > > seconds > > (there's no log of how long; the logger message is from the pre > > -reboot > > os) to start and it takes longer. > > > > As far as I can tell, this value doesn't seem to be configurable or > > even package specific. It looks remarkably silly for the init > > system > > to impose an absolute timeout on service start, particularly when > > it > > doesn't take into account the characteristics of the machine or ask > > the package how long it might reasonably take. > > > > So far, it's only spamassassin, so it's annoying but not serious to > > have to log in and restart it after every reboot. However, if > > systemd > > did this to a necessary service, it would become a serious bug > > Can you provide instructions how this issue can be reproduced? You mean reproduce this outside of booting the system? I have no idea how to do that. I suspect it's a load issue, so this is the system: cat /prbedivere:~# cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Pentium(R) 4 CPU 2.80GHz stepping: 9 microcode : 0x2e cpu MHz : 2813.471 cache size : 512 KB physical id : 0 siblings: 1 core id : 0 cpu cores : 1 apicid : 0 initial apicid : 0 fdiv_bug: no f00f_bug: no coma_bug: no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pebs bts cid xtpr bugs: bogomips: 5626.94 clflush size: 64 cache_alignment : 128 address sizes : 36 bits physical, 32 bits virtual power management: bedivere:~# cat /proc/meminfo MemTotal:1022016 kB MemFree: 43816 kB MemAvailable: 346924 kB Buffers: 36980 kB Cached: 295580 kB SwapCached:39344 kB Active: 528756 kB Inactive: 395168 kB Active(anon): 321488 kB Inactive(anon): 306540 kB Active(file): 207268 kB Inactive(file):88628 kB Unevictable: 0 kB Mlocked: 0 kB HighTotal:131016 kB HighFree: 7628 kB LowTotal: 891000 kB LowFree: 36188 kB SwapTotal: 1951892 kB SwapFree:1582008 kB Dirty: 120 kB Writeback: 0 kB AnonPages:568992 kB Mapped:54324 kB Shmem: 36664 kB Slab: 36280 kB SReclaimable: 21972 kB SUnreclaim:14308 kB KernelStack:3096 kB PageTables: 4420 kB NFS_Unstable: 0 kB Bounce:0 kB WritebackTmp: 0 kB CommitLimit: 2462900 kB Committed_AS:1973536 kB VmallocTotal: 122880 kB VmallocUsed: 11120 kB VmallocChunk: 103032 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB HugePages_Total: 0 HugePages_Free:0 HugePages_Rsvd:0 HugePages_Surp:0 Hugepagesize: 4096 kB DirectMap4k:
Bug#793814: Workaround for this bug
On Wed, 2016-05-11 at 12:06 +0200, Michael Biebl wrote: > Hi James > > On Thu, 15 Oct 2015 10:55:05 -0700 James Bottomley > wrote: > > On Thu, 2015-10-15 at 19:45 +0200, Michael Biebl wrote: > > > Btw, in both cases (#770135 and #793814) the users have been > > > restarting > > > the dbus daemon. You mentioned that you did not do that. > > > > > > So I'm not sure if it's actually the same issue after all and > > > your > > > problem should be tracked as a separate issue. > > > > I'm starting to think it's a logind robustness problem. I traced > > back > > through the logs to the first instance after reboot and this is > > what I > > find: > > .. > > Logind is still active and replies later in the trace, so it looks > like > > dbus either dropped a message or did some type of unexpected > > disconnect. > > After this, logind works until it can't abandon the session, then > > it > > never replies on the bus again. So I suspect somewhere in the > > error > > handling inside logind it doesn't cope with unexpected loss of dbus > > messages. > > We recently had this upstream bug report, which looks like it could > be > relevant: > https://github.com/systemd/systemd/issues/1961 > > Do you see session scopes piling up on your system? bedivere:~# ls -ld /run/systemd/system/session-*|wc -l 141 So apparently, yes. signature.asc Description: This is a digitally signed message part ___ Pkg-systemd-maintainers mailing list Pkg-systemd-maintainers@lists.alioth.debian.org http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-systemd-maintainers