Bug#793814: I just ran into this on debian testing

2015-09-17 Thread James Bottomley
Package: systemd
Version: 225-1

My fatal symptoms are that Postfix SASL stops working with

Sep 17 07:11:11 bedivere postfix/smtpd[20955]: warning: unknown[x.x.x.x]: SASL 
PLAIN authentication failed: Connection lost to authentication server

Which is really annoying because it requires fixing by restarting
systemd-logind on the server and then restarting evolution on the
client.

I also get the delayed login problem until systemd-logind is restarted,
but that's minor compared to the postfix failure, which triggers
complaints from my users.

James

___
Pkg-systemd-maintainers mailing list
Pkg-systemd-maintainers@lists.alioth.debian.org
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-systemd-maintainers


Bug#800087: systemd lists running daemons as failed after reboot

2015-10-03 Thread James Bottomley
On Wed, 2015-09-30 at 15:20 +0200, Michael Biebl wrote:
> Control: tags -1 + moreinfo
> 
> Am 26.09.2015 um 18:08 schrieb James Bottomley:
> > Package: systemd
> > Version: 226-3
> > Severity: normal
> > 
> > rebooting the current system often causes listed init system failures.
> > in addition to failing to start spamd (listed under separate bug).
> > Systemd thinks units have failed when, in fact, they're running.  The
> > most common units for this are apache2 and mysql.
> > 
> > This is the state of a recent reboot, where apache2 is listed as failed
> > 
> > bedivere:~# systemctl --failed
> >   UNITLOAD   ACTIVE SUBDESCRIPTION
> > ● apache2.service loaded failed failed LSB: Apache2 web server
> > 
> > LOAD   = Reflects whether the unit definition was properly loaded.
> > ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
> > SUB= The low-level unit activation state, values depend on unit type.
> > 
> > 1 loaded units listed. Pass --all to see loaded but inactive units, too.
> > To show all installed unit files use 'systemctl list-unit-files'.
> > 
> > Whereas ps shows it to be running
> 
> What's the output of
> systemctl status apache2.service ?

OK, so the machine only has a maintenance window on saturday.  I
rebooted and asked this and this is the result:

bedivere:~# systemctl status -l apache2.service
● apache2.service - LSB: Apache2 web server
   Loaded: loaded (/etc/init.d/apache2)
   Active: failed (Result: exit-code) since Sat 2015-10-03 10:45:42 PDT; 1min 
53s ago
 Docs: man:systemd-sysv-generator(8)
  Process: 693 ExecStart=/etc/init.d/apache2 start (code=exited, 
status=1/FAILURE)
   CGroup: /system.slice/apache2.service
   ├─1407 /usr/sbin/apache2 -k start
   ├─1410 PassengerAgent watchdog   
  
   ├─1417 PassengerAgent server 

   ├─1422 PassengerAgent logger 

   ├─1699 /usr/sbin/apache2 -k start
   ├─1701 /usr/sbin/apache2 -k start
   ├─1702 /usr/sbin/apache2 -k start
   ├─1704 /usr/sbin/apache2 -k start
   └─1706 /usr/sbin/apache2 -k start

Oct 03 10:44:38 bedivere systemd[1]: Starting LSB: Apache2 web server...
Oct 03 10:45:06 bedivere apache2[693]: Starting web server: apache2AH00180: 
WARNING: MaxRequestWorkers of 20 exceeds ServerLimit value of
Oct 03 10:45:06 bedivere apache2[693]: 5 servers, decreasing MaxRequestWorkers 
to 5.
Oct 03 10:45:06 bedivere apache2[693]: To increase, please see the ServerLimit 
directive.
Oct 03 10:45:42 bedivere apache2[693]: failed!
Oct 03 10:45:42 bedivere apache2[693]: The apache2 instance did not start 
within 20 seconds. Please read the log files to discover problems ... (warning).
Oct 03 10:45:42 bedivere systemd[1]: apache2.service: Control process exited, 
code=exited status=1
Oct 03 10:45:42 bedivere systemd[1]: Failed to start LSB: Apache2 web server.
Oct 03 10:45:42 bedivere systemd[1]: apache2.service: Unit entered failed state.
Oct 03 10:45:42 bedivere systemd[1]: apache2.service: Failed with result 
'exit-code'.

James

___
Pkg-systemd-maintainers mailing list
Pkg-systemd-maintainers@lists.alioth.debian.org
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-systemd-maintainers

Bug#800087: closed by Michael Biebl (Re: Bug#800087: systemd lists running daemons as failed after reboot)

2015-10-04 Thread James Bottomley
On Sun, 2015-10-04 at 23:06 +, Debian Bug Tracking System wrote:
> Well, this is apparently an apache configuration problem, causing the
> service to return a non-zero exit code.
> 
> Once you fix that, the service should not be marked as failed.

That's pretty obviously an incorrect deduction, isn't it?  If it were a
config file problem leading to the control process erroring out then the
manual systemctl start apache would also fail; but, as you can see from
the description in the original report, it doesn't.

The logs have this in them:

[ 2015-10-03 10:41:55.6129 1309/b5cfeb40 age/Hel/Main.cpp:455 ]: Signal 
received. Gracefully shutting down... (send signal 2 more time(s) to force 
shutdown)
[ 2015-10-03 10:41:55.6160 1319/b6960b40 age/Log/Main.cpp:349 ]: Signal 
received. Gracefully shutting down... (send signal 2 more time(s) to force 
shutdown)
[ 2015-10-03 10:45:21.1489 1353/b5dfeb40 age/Hel/Main.cpp:455 ]: Signal 
received. Gracefully shutting down... (send signal 2 more time(s) to force 
shutdown)
[ 2015-10-03 10:45:21.1521 1374/b68fab40 age/Log/Main.cpp:349 ]: Signal 
received. Gracefully shutting down... (send signal 2 more time(s) to force 
shutdown)

So apache dies because something external kills it during boot up.  Hm,
I wonder what that could be ...

James

___
Pkg-systemd-maintainers mailing list
Pkg-systemd-maintainers@lists.alioth.debian.org
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-systemd-maintainers


Bug#793814: Workaround for this bug

2015-10-14 Thread James Bottomley
Since there doesn't seem to have been any progress on this from the
systemd side, and I'm really tired of my email users yelling at me
(justifiably since once the systemd-logind service dies, dovecot sasl
fails to work and they can't send email), this is the script I came up
with to alleviate the problem:

---
#!/bin/bash
##
# shell script to restart systemd.logind when it fails
##
tail -F /var/log/auth.log|while read junk; do
case $junk in
*pam_systemd*org.freedesktop.login1*timed\ out*)
echo $junk
systemctl restart systemd-logind.service
;;
esac
done
---

You spawn it from init (redirected to a log) and it will monitor the
authentication log for any indication that systemd-logind has dropped
off the dbus and respawn it.   The effect is that even if they get a
SASL authentication failure one time, it will already be working the
next time they try.

James

___
Pkg-systemd-maintainers mailing list
Pkg-systemd-maintainers@lists.alioth.debian.org
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-systemd-maintainers


Bug#793814: Workaround for this bug

2015-10-14 Thread James Bottomley
On Thu, 2015-10-15 at 00:30 +0200, Michael Biebl wrote:
> Am 15.10.2015 um 00:16 schrieb James Bottomley:
> > Since there doesn't seem to have been any progress on this from the
> > systemd side, and I'm really tired of my email users yelling at me
> > (justifiably since once the systemd-logind service dies, dovecot sasl
> > fails to work and they can't send email), this is the script I came up
> > with to alleviate the problem:
> 
> So are you restarting dbus as well for some reason?

You can't really restart dbus in a systemd system...  However, the
problem seems to be the endpoint stops listening.  systemd-logind is
still running but strace -p shows it no longer responds when something
contacts the end point.  If you restart it, it will listen for a while
before stopping again.

The script simply detects the dbus timeout and restarts the service.

James



signature.asc
Description: This is a digitally signed message part
___
Pkg-systemd-maintainers mailing list
Pkg-systemd-maintainers@lists.alioth.debian.org
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-systemd-maintainers

Bug#793814: Workaround for this bug

2015-10-14 Thread James Bottomley
On Thu, 2015-10-15 at 01:06 +0200, Michael Biebl wrote:
> Am 15.10.2015 um 00:34 schrieb James Bottomley:
> > On Thu, 2015-10-15 at 00:30 +0200, Michael Biebl wrote:
> >> Am 15.10.2015 um 00:16 schrieb James Bottomley:
> >>> Since there doesn't seem to have been any progress on this from the
> >>> systemd side, and I'm really tired of my email users yelling at me
> >>> (justifiably since once the systemd-logind service dies, dovecot sasl
> >>> fails to work and they can't send email), this is the script I came up
> >>> with to alleviate the problem:
> >>
> >> So are you restarting dbus as well for some reason?
> > 
> > You can't really restart dbus in a systemd system...  However, the
> > problem seems to be the endpoint stops listening.  systemd-logind is
> > still running but strace -p shows it no longer responds when something
> > contacts the end point.  If you restart it, it will listen for a while
> > before stopping again.
> 
> If in your case you are not restarting dbus, can you isolate what's
> causing logind to stop responding on your system?
> I.e., can you provide us with steps how to reproduce the issue.

It happens naturally in a running system.  This one supplies
sparkleshare sync services for shared and backup directories, so there
are a huge number of micro sshd logins doing syncs, SASL via dovecot for
postfix and dovecot itself.  It's not based on LDAP, so everything goes
via the password file based authenticator.

The logs indicate that from reboot to first timeout took about 20 hours.
Thereafter, the timeouts seem to increase in frequency to about two or
three an hour.

James



signature.asc
Description: This is a digitally signed message part
___
Pkg-systemd-maintainers mailing list
Pkg-systemd-maintainers@lists.alioth.debian.org
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-systemd-maintainers

Bug#793814: Workaround for this bug

2015-10-15 Thread James Bottomley
On Thu, 2015-10-15 at 07:35 +0200, Stefano wrote:
> Same here. I have a git server with clients connecting every 10 minutes or
> so via SSH, and the problem is exactly the same.
> 
> About 20 hours from reboot logind goes nuts, then it must be restarted at
> very variable intervals, about a couple a day.

What architecture is this?  I only see the problems on a 32 bit x86
system.  I don't see them on a 64 bit x86 in spite of trying to trigger
them there (in a sample size of 1 of each, so it's not statistically
significant ... yet).

I also see that /run/systemd/sessions has about a hundred or so sessions
whose leaders are dead on the 32 bit system.  On the 64 bit system it
appears only still living sessions are in here.

The final curious observation is that on the 32 bit system, the session
id never seems to go over 4k.  On the 64 bit system, it does.

James

___
Pkg-systemd-maintainers mailing list
Pkg-systemd-maintainers@lists.alioth.debian.org
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-systemd-maintainers


Bug#793814: Workaround for this bug

2015-10-15 Thread James Bottomley
On Thu, 2015-10-15 at 19:45 +0200, Michael Biebl wrote:
> Btw, in both cases (#770135 and #793814) the users have been restarting
> the dbus daemon. You mentioned that you did not do that.
> 
> So I'm not sure if it's actually the same issue after all and your
> problem should be tracked as a separate issue.

I'm starting to think it's a logind robustness problem.  I traced back
through the logs to the first instance after reboot and this is what I
find:

Oct 11 04:17:01 bedivere CRON[2185]: pam_unix(cron:session): session opened for 
user root by (uid=0)
Oct 11 04:17:01 bedivere CRON[2185]: pam_unix(cron:session): session closed for 
user root
Oct 11 04:17:40 bedivere sshd[2193]: Accepted publickey for clh15 from 
184.11.141.41 port 38172 ssh2: DSA 
SHA256:UK5h9LRMtBthvW0Ncv1SG4WRmSFNs1hPcowPzzyt+iY
Oct 11 04:17:40 bedivere sshd[2193]: pam_unix(sshd:session): session opened for 
user clh15 by (uid=0)
Oct 11 04:17:40 bedivere systemd-logind[640]: New session 916 of user clh15.
Oct 11 04:17:40 bedivere sshd[2195]: Accepted publickey for clh15 from 
184.11.141.41 port 38173 ssh2: DSA 
SHA256:UK5h9LRMtBthvW0Ncv1SG4WRmSFNs1hPcowPzzyt+iY
Oct 11 04:17:40 bedivere sshd[2195]: pam_unix(sshd:session): session opened for 
user clh15 by (uid=0)
Oct 11 04:17:40 bedivere systemd-logind[640]: New session 917 of user clh15.
Oct 11 04:17:40 bedivere sshd[2195]: pam_systemd(sshd:session): Failed to 
create session: Message recipient disconnected from message bus without replying
Oct 11 04:17:40 bedivere systemd-logind[640]: Failed to abandon session scope: 
Transport endpoint is not connected
Oct 11 04:17:40 bedivere sshd[2204]: Received disconnect from 184.11.141.41: 
11: disconnected by user
Oct 11 04:17:40 bedivere sshd[2204]: Disconnected from 184.11.141.41
Oct 11 04:17:40 bedivere sshd[2193]: pam_unix(sshd:session): session closed for 
user clh15
Oct 11 04:17:40 bedivere sshd[2203]: Received disconnect from 184.11.141.41: 
11: disconnected by user
Oct 11 04:17:40 bedivere sshd[2203]: Disconnected from 184.11.141.41
Oct 11 04:17:40 bedivere sshd[2195]: pam_unix(sshd:session): session closed for 
user clh15
Oct 11 04:18:05 bedivere sshd[2193]: pam_systemd(sshd:session): Failed to 
release session: Connection timed out
Oct 11 04:18:05 bedivere systemd-logind[640]: Failed to abandon session scope: 
Transport endpoint is not connected
Oct 11 04:18:05 bedivere dbus[698]: [system] Failed to activate service 
'org.freedesktop.login1': timed out

Thereafter everything fails to activate org.freedesktop.login1.

However, it looks like the trouble is here:

Oct 11 04:17:40 bedivere sshd[2195]: pam_systemd(sshd:session):
Failed to create session: Message recipient disconnected from
message bus without replying

Logind is still active and replies later in the trace, so it looks like
dbus either dropped a message or did some type of unexpected disconnect.
After this, logind works until it can't abandon the session, then it
never replies on the bus again.  So I suspect somewhere in the error
handling inside logind it doesn't cope with unexpected loss of dbus
messages.

James



signature.asc
Description: This is a digitally signed message part
___
Pkg-systemd-maintainers mailing list
Pkg-systemd-maintainers@lists.alioth.debian.org
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-systemd-maintainers

Bug#775578: systemd kills spamassassin on system start

2016-02-11 Thread James Bottomley
On Thu, 2016-02-11 at 23:41 +0100, Michael Biebl wrote:
> Control: tags -1 + moreinfo
> 
> On Sat, 17 Jan 2015 09:08:32 -0800 James Bottomley
>  wrote:
> > Package: systemd
> > Version: 215-8
> > Severity: normal
> > 
> > Almost every time the system reboots, spamassassin fails to start. 
> >  The systemd logs for this are:
> > 
> > # systemctl status -l spamassassin.service
> > ● spamassassin.service - Perl-based spam filter using text
> > analysis
> >Loaded: loaded (/lib/systemd/system/spamassassin.service;
> > enabled)
> >Active: failed (Result: timeout) since Sat 2015-01-17 08:49:04
> > PST; 3min 45s ago
> >   Process: 528 ExecStart=/usr/sbin/spamd -d -
> > -pidfile=/var/run/spamassassin.pid $OPTIONS (code=killed,
> > signal=TERM)
> > 
> > Jan 17 08:48:10 bedivere spamd[528]: logger: removing stderr method
> > Jan 17 08:49:04 bedivere systemd[1]: spamassassin.service start
> > operation timed out. Terminating.
> > Jan 17 08:49:04 bedivere systemd[1]: Failed to start Perl-based
> > spam filter using text analysis.
> > Jan 17 08:49:04 bedivere systemd[1]: Unit spamassassin.service
> > entered failed state.
> > Jan 17 08:49:04 bedivere spamd[748]: spamd: server killed by
> > SIGTERM, shutting down
> > Jan 17 08:49:04 bedivere spamd[748]: spamd: cannot unlink
> > /var/run/spamassassin.pid: No such file or directory
> > 
> > This server is still x86 and a big internet system, so it has lots
> > of
> > intensive processes on start, like fail2ban , clamd and apache.  It
> > looks like because of this, systemd gives spamassassin a few
> > seconds
> > (there's no log of how long; the logger message is from the pre
> > -reboot
> > os) to start and it takes longer.
> > 
> > As far as I can tell, this value doesn't seem to be configurable or
> > even package specific.  It looks remarkably silly for the init
> > system
> > to impose an absolute timeout on service start, particularly when
> > it
> > doesn't take into account the characteristics of the machine or ask
> > the package how long it might reasonably take.
> > 
> > So far, it's only spamassassin, so it's annoying but not serious to
> > have to log in and restart it after every reboot.  However, if
> > systemd
> > did this to a necessary service, it would become a serious bug
> 
> Can you provide instructions how this issue can be reproduced?

You mean reproduce this outside of booting the system? I have no idea
how to do that.  I suspect it's a load issue, so this is the system:

cat /prbedivere:~# cat /proc/cpuinfo 
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 15
model   : 2
model name  : Intel(R) Pentium(R) 4 CPU 2.80GHz
stepping: 9
microcode   : 0x2e
cpu MHz : 2813.471
cache size  : 512 KB
physical id : 0
siblings: 1
core id : 0
cpu cores   : 1
apicid  : 0
initial apicid  : 0
fdiv_bug: no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 2
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
pebs bts cid xtpr
bugs:
bogomips: 5626.94
clflush size: 64
cache_alignment : 128
address sizes   : 36 bits physical, 32 bits virtual
power management:

bedivere:~# cat /proc/meminfo 
MemTotal:1022016 kB
MemFree:   43816 kB
MemAvailable: 346924 kB
Buffers:   36980 kB
Cached:   295580 kB
SwapCached:39344 kB
Active:   528756 kB
Inactive: 395168 kB
Active(anon): 321488 kB
Inactive(anon):   306540 kB
Active(file): 207268 kB
Inactive(file):88628 kB
Unevictable:   0 kB
Mlocked:   0 kB
HighTotal:131016 kB
HighFree:   7628 kB
LowTotal: 891000 kB
LowFree:   36188 kB
SwapTotal:   1951892 kB
SwapFree:1582008 kB
Dirty:   120 kB
Writeback: 0 kB
AnonPages:568992 kB
Mapped:54324 kB
Shmem: 36664 kB
Slab:  36280 kB
SReclaimable:  21972 kB
SUnreclaim:14308 kB
KernelStack:3096 kB
PageTables: 4420 kB
NFS_Unstable:  0 kB
Bounce:0 kB
WritebackTmp:  0 kB
CommitLimit: 2462900 kB
Committed_AS:1973536 kB
VmallocTotal: 122880 kB
VmallocUsed:   11120 kB
VmallocChunk: 103032 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
HugePages_Total:   0
HugePages_Free:0
HugePages_Rsvd:0
HugePages_Surp:0
Hugepagesize:   4096 kB
DirectMap4k: 

Bug#793814: Workaround for this bug

2016-05-11 Thread James Bottomley
On Wed, 2016-05-11 at 12:06 +0200, Michael Biebl wrote:
> Hi James
> 
> On Thu, 15 Oct 2015 10:55:05 -0700 James Bottomley
>  wrote:
> > On Thu, 2015-10-15 at 19:45 +0200, Michael Biebl wrote:
> > > Btw, in both cases (#770135 and #793814) the users have been
> > > restarting
> > > the dbus daemon. You mentioned that you did not do that.
> > > 
> > > So I'm not sure if it's actually the same issue after all and
> > > your
> > > problem should be tracked as a separate issue.
> > 
> > I'm starting to think it's a logind robustness problem.  I traced
> > back
> > through the logs to the first instance after reboot and this is
> > what I
> > find:
> 
> ..
> 
>  Logind is still active and replies later in the trace, so it looks
> like
> > dbus either dropped a message or did some type of unexpected
> > disconnect.
> > After this, logind works until it can't abandon the session, then
> > it
> > never replies on the bus again.  So I suspect somewhere in the
> > error
> > handling inside logind it doesn't cope with unexpected loss of dbus
> > messages.
> 
> We recently had this upstream bug report, which looks like it could
> be
> relevant:
> https://github.com/systemd/systemd/issues/1961
> 
> Do you see session scopes piling up on your system?

bedivere:~# ls -ld /run/systemd/system/session-*|wc -l
141

So apparently, yes.

signature.asc
Description: This is a digitally signed message part
___
Pkg-systemd-maintainers mailing list
Pkg-systemd-maintainers@lists.alioth.debian.org
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-systemd-maintainers