>> What is the output of netstat?

> I will add a 'netstat' command to my 'spamassassin' init script,
> before the spamd command, in order to see what listens to port
> 783 at that time.
> I will post the output of netstat when available.

So did I. And I waited for the same problem to happen again. I am
still using SA 3.0.1.


# /etc/init.d/spamassassin (section "start") :
(...)
case "$1" in
  start)
(...)
/bin/netstat -ltnpv >/var/log/test_netstat.`/bin/date +%Y%m%d%H%M`.log
2>&1
/usr/bin/spamd -c -u spam >/var/log/test_spamd.`/bin/date
+%Y%m%d%H%M`.log 2>&1 &
(...)


# grep 783 /var/log/test_netstat.200501140004.log
tcp   0   0   127.0.0.1:783   0.0.0.0:*   LISTEN   28370/spamd child


# cat /var/log/test_spamd.200501140004.log
Could not create INET socket on 127.0.0.1:783: Address already in use
 (IO::Socket::INET: Address already in use)


Relevant maillog lines (from syslog) :

(my watchdog script, run hourly at HH+30 minutes)
Jan 13 23:30:01 server watchdog[26959]: spamd is running with pids :
 26640 26593 26411 26215 26181

(midnight : /etc/init.d/postfix stop)
Jan 14 00:00:00 server postfix/postfix-script: stopping the Postfix mail
system
Jan 14 00:00:00 server postfix/master[18136]: terminating on signal 15

(sleep 30, then /etc/init.d/spamassassin stop)
Jan 14 00:00:30 server spamd[18069]: server hit by SIGCHLD
Jan 14 00:00:30 server spamd[18069]: handled cleanup of child pid 26640
Jan 14 00:00:30 server spamd[18069]: server successfully spawned child
process, pid 28369
Jan 14 00:00:30 server spamd[18069]: server hit by SIGCHLD
Jan 14 00:00:30 server spamd[18069]: handled cleanup of child pid 26411
Jan 14 00:00:30 server spamd[18069]: handled cleanup of child pid 26593
Jan 14 00:00:30 server spamd[18069]: server killed by SIGTERM, shutting
down
Jan 14 00:00:30 server spamd[18069]: server hit by SIGCHLD
Jan 14 00:00:30 server spamd[18069]: handled cleanup of child pid 26181
Jan 14 00:00:30 server spamd[18069]: handled cleanup of child pid 26215

(sa-learn --force-expire, takes several minutes, then
/etc/init.d/spamassassin start)
(this spamd process will immediately die, saying Address already in use)
(A process named "spamd child", with pid 28370 is indeed listening to
that port)
Jan 14 00:04:12 server spamd[28387]: spamd starting

(sleep 30, then /etc/init.d/postfix start)
Jan 14 00:04:42 server postfix/postfix-script: starting the Postfix mail
system
Jan 14 00:04:42 server postfix/master[28449]: daemon started -- version
2.1.5

(several mails get processed by SA, all by the same spamd pid : 28370)
Jan 14 00:04:50 server spamd[28370]: connection from server [127.0.0.1]
at port 38402
Jan 14 00:04:50 server spamd[28370]: processing message ...
Jan 14 00:05:01 server spamd[28370]: identified spam ...
(... Other good behaving spamd lines ...)

(several mails and minutes later, spamc/spamd run into problems)
Jan 14 00:21:04 server spamc[29119]: connect(AF_INET) to spamd at
127.0.0.1 failed, retrying (#1 of 3): Connection refused
(... Other missbehaving spamc lines ...)

(my watchdog script, run hourly at HH+30 minutes)
(no spamd is running, and a spamassassin start is issued, which
effectively starts spamd)
(this time, no process is listening on port 783 according to a second
netstat output)
Jan 14 00:30:00 server watchdog[29404]: spamd is not running : starting
it
Jan 14 00:30:00 server spamd[29410]: spamd starting
Jan 14 00:30:02 server spamd[29410]: server started on port 783/tcp
(running version 3.0.1)
Jan 14 00:30:02 server spamd[29410]: server successfully spawned child
process, pid 29426
Jan 14 00:30:02 server spamd[29410]: server successfully spawned child
process, pid 29427
Jan 14 00:30:02 server spamd[29410]: server successfully spawned child
process, pid 29428
Jan 14 00:30:02 server spamd[29410]: server successfully spawned child
process, pid 29429
Jan 14 00:30:02 server spamd[29410]: server successfully spawned child
process, pid 29430

------------------------------------------------------------------------
In my case, 30 seconds after "postfix stop" (that is 30 seconds after
midnight), the command "spamassassin stop" is started. This
"spamassassin" script comes from RedHat 9 and issues a "killproc spamd".
"killproc" is defined in the "functions" script, also provided by RedHat
9. This might issue a command like "kill -TERM" on "spamd" pid but also
on "spamd child" pids too (I am not that familiar with shell scripts).

I suspect this SIGTERM could hit some "spamd child" first, before
hitting the parent spamd. In some cases, the parent spamd could have
enough time (before being hit itself by SIGTERM) to receive a SIGCHLD,
and to try to respawn a child (as it normally does, when hit by
SIGCHLD).

Reading my maillog file, it seems to me that all the 5 children pids
seen by my watchdog script produce a line like "handled cleanup of child
pid ...". But the parent spamd process has got enough time to spawn a
new child with pid 28369, which seems to not get "cleanup handled". Some
process (maybe this one 28369) start a new "spamd child" process, with
pid 28370. This 28370 "spamd child" process listens on port 783 which
seems strange, because in a normal situation, netstat reports "perl" as
the listening process on port 783, not "spamd child".


Maybe I should just change the /etc/init.d/spamassassin script to issue
a "kill -TERM" only on the parent spamd process, not on all children, in
order to see if it helps ?


Maybe all my explanation is plain crap.

Many thanks to anyone helping me to understand what is going on.


Reply via email to