Everything is good so no stress about anything here but I am poking at
the log files with a stick after a strange incident. Perhaps this
tripped over some problem that discussing it might either enlighten me
or perhaps unlikely improve things. Who knows?
The GNU Savannah software forge had a network outage in the data
center lasting about eight hours. It was the dark of the night and
things were fixed quickly once the admins woke up at sunrise and went
to the data center to fix things. Due to the timing this was an
unusually long network outage event.
I would like to describe four of the VMs of interest here. Two were
okay after networking returned. But two were found afterward without
postfix running. I am curious if the why is somehow useful or
interesting to know.
All of the systems have their root block storage on a ceph network
attached storage pool. Which of course meant that the root file
system was unavailable for the full time of the eight hour outage.
Therefore if some bit of file data is cached and not expired then the
Linux kernel can service the request. If not and if it needs to read
the data then it attempts a network read and of course blocks waiting
for network I/O.
Of course cron jobs were still running. And stacking up processes
blocked on I/O waiting. One server achieved a load average of 520 and
was perfectly fine recovering after networking was restored. Another
reached a load of 68 but afterward the postfix daemons were found to
be not running. In summary: Two of the four had no discernible
failures. Two of the four were found with postfix not running
afterward. Postfix seems to have been the only noticed failure.
Which I found rather unusual. And perhaps noteworthy. But this is a
very unusual situation where the root file system is unavailable for
an extended period of time.
I am simply reviewing things afterward now. Trying to understand and
perhaps improve things. Since two of them failed. But again there is
no stress here. Everything is all good now. And this is a highly
unusual system event. Because those two were found without postfix
running I rebooted all of the servers subsequently as a preventative
maintenance action. Even though others seemed perfectly okay
afterward. Because almost certainly there would be other as yet not
found problems.
Any ideas on why postfix would not be running after such an event on
two of the systems but okay on the others?
Bob
I am abbreviating everything because it was too large for the mailing
list on the first sending of this message. :-)
Here is one of the config files for the two where postfix was found
not running. I'll put the long details below. The other two are
similar. This is on an older Trisquel OS. Trisquel is a fork of
Ubuntu. So basically think Ubuntu here.
root@vcs1:~# postconf -n
alias_database = hash:/etc/aliases
alias_maps = hash:/etc/aliases
append_dot_mydomain = no
biff = no
canonical_maps = hash:/etc/postfix/canonical
compatibility_level = 2
inet_interfaces = loopback-only
inet_protocols = ipv4
mailbox_size_limit = 0
masquerade_domains = savannah.gnu.org
masquerade_exceptions = root
mydestination = $myhostname, localhost.$mydomain, localhost
myhostname = vcs1.savannah.gnu.org
mynetworks = 127.0.0.0/8 [::ffff:127.0.0.0]/104 [::1]/128
myorigin = /etc/mailname
non_smtpd_milters = unix:/var/run/opendkim/opendkim.sock
readme_directory = no
recipient_delimiter = +
relayhost =
smtp_tls_session_cache_database = btree:${data_directory}/smtp_scache
smtpd_banner = $myhostname ESMTP $mail_name (Debian/GNU)
smtpd_milters = unix:/var/run/opendkim/opendkim.sock
smtpd_relay_restrictions = permit_mynetworks permit_sasl_authenticated
defer_unauth_destination
smtpd_tls_cert_file = /etc/ssl/certs/ssl-cert-snakeoil.pem
smtpd_tls_key_file = /etc/ssl/private/ssl-cert-snakeoil.key
smtpd_tls_session_cache_database = btree:${data_directory}/smtpd_scache
smtpd_use_tls = yes
And here is the smaller of the log events. Timezone is US/Eastern for both.
...everything is fine...
Dec 20 01:45:45 vcs1 postfix/qmgr[1996]: A25F0A908C: removed
Dec 20 09:43:31 vcs1 postfix/master[1983]: warning: unix_trigger_event:
read timeout for service public/pickup
Dec 20 09:43:31 vcs1 postfix/master[1983]: warning: unix_trigger_event:
read timeout for service public/pickup
Dec 20 09:43:31 vcs1 postfix/master[1983]: warning: unix_trigger_event:
read timeout for service public/pickup
...I trimmed out 27 identical lines just to reduce this...
Dec 20 09:43:31 vcs1 postfix/master[1983]: warning: unix_trigger_event:
read timeout for service public/pickup
Dec 20 09:43:32 vcs1 postfix/master[1983]: warning: unix_trigger_event:
read timeout for service public/pickup
Dec 20 09:43:32 vcs1 postfix/master[1983]: warning: unix_trigger_event:
read timeout for service public/pickup
Dec 20 09:43:32 vcs1 postfix/postfix-script[30251]: fatal: the Postfix mail
system is not running
...I really rather arbitrarily trimmed this for mailing list size...
Dec 20 03:01:31 frontend1 postfix/qmgr[10024]: warning: problem talking to
service rewrite: Connection timed out
Dec 20 03:01:58 frontend1 postfix/master[10022]: warning:
unix_trigger_event: read timeout for service private/tlsmgr
Dec 20 03:02:13 frontend1 postfix/master[10022]: warning:
unix_trigger_event: read timeout for service public/qmgr
...
Dec 20 04:51:34 frontend1 postfix/master[10022]: warning:
unix_trigger_event: read timeout for service public/pickup
Dec 20 04:51:53 frontend1 postfix/master[10022]: warning:
master_wakeup_timer_event: service pickup(public/pickup): Resource temporarily
unavailable
Dec 20 04:52:14 frontend1 postfix/master[10022]: warning:
unix_trigger_event: read timeout for service public/qmgr
Dec 20 04:52:34 frontend1 postfix/master[10022]: warning:
unix_trigger_event: read timeout for service public/pickup
Dec 20 04:52:54 frontend1 postfix/master[10022]: warning:
master_wakeup_timer_event: service pickup(public/pickup): Resource temporarily
unavailable
...
Dec 20 06:46:58 frontend1 postfix/master[10022]: warning:
master_wakeup_timer_event: service pickup(public/pickup): Resource temporarily
unavailable
Dec 20 06:47:16 frontend1 postfix/master[10022]: warning:
unix_trigger_event: read timeout for service public/qmgr
Dec 20 09:43:22 frontend1 postfix/qmgr[17568]: CA6EE20D2C:
from=<[email protected]>, size=1479, nrcpt=1 (queue active)
Dec 20 09:43:22 frontend1 postfix/qmgr[17568]: warning: connect #1 to
subsystem private/rewrite: Connection refused
Dec 20 09:43:32 frontend1 postfix/qmgr[17568]: warning: connect #2 to
subsystem private/rewrite: Connection refused
Dec 20 09:43:42 frontend1 postfix/qmgr[17568]: warning: connect #3 to
subsystem private/rewrite: Connection refused
Dec 20 09:43:52 frontend1 postfix/qmgr[17568]: warning: connect #4 to
subsystem private/rewrite: Connection refused
Dec 20 09:44:02 frontend1 postfix/qmgr[17568]: warning: connect #5 to
subsystem private/rewrite: Connection refused
Dec 20 09:44:12 frontend1 postfix/qmgr[17568]: warning: connect #6 to
subsystem private/rewrite: Connection refused
Dec 20 09:44:22 frontend1 postfix/qmgr[17568]: warning: connect #7 to
subsystem private/rewrite: Connection refused
Dec 20 09:44:32 frontend1 postfix/qmgr[17568]: warning: connect #8 to
subsystem private/rewrite: Connection refused
Dec 20 09:44:42 frontend1 postfix/qmgr[17568]: warning: connect #9 to
subsystem private/rewrite: Connection refused
Dec 20 09:44:52 frontend1 postfix/qmgr[17568]: warning: connect #10 to
subsystem private/rewrite: Connection refused
Dec 20 09:45:02 frontend1 postfix/qmgr[17568]: fatal: connect #11 to
subsystem private/rewrite: Connection refused