>> What is the output of netstat? > I will add a 'netstat' command to my 'spamassassin' init script, > before the spamd command, in order to see what listens to port > 783 at that time. > I will post the output of netstat when available.
So did I. And I waited for the same problem to happen again. I am still using SA 3.0.1. # /etc/init.d/spamassassin (section "start") : (...) case "$1" in start) (...) /bin/netstat -ltnpv >/var/log/test_netstat.`/bin/date +%Y%m%d%H%M`.log 2>&1 /usr/bin/spamd -c -u spam >/var/log/test_spamd.`/bin/date +%Y%m%d%H%M`.log 2>&1 & (...) # grep 783 /var/log/test_netstat.200501140004.log tcp 0 0 127.0.0.1:783 0.0.0.0:* LISTEN 28370/spamd child # cat /var/log/test_spamd.200501140004.log Could not create INET socket on 127.0.0.1:783: Address already in use (IO::Socket::INET: Address already in use) Relevant maillog lines (from syslog) : (my watchdog script, run hourly at HH+30 minutes) Jan 13 23:30:01 server watchdog[26959]: spamd is running with pids : 26640 26593 26411 26215 26181 (midnight : /etc/init.d/postfix stop) Jan 14 00:00:00 server postfix/postfix-script: stopping the Postfix mail system Jan 14 00:00:00 server postfix/master[18136]: terminating on signal 15 (sleep 30, then /etc/init.d/spamassassin stop) Jan 14 00:00:30 server spamd[18069]: server hit by SIGCHLD Jan 14 00:00:30 server spamd[18069]: handled cleanup of child pid 26640 Jan 14 00:00:30 server spamd[18069]: server successfully spawned child process, pid 28369 Jan 14 00:00:30 server spamd[18069]: server hit by SIGCHLD Jan 14 00:00:30 server spamd[18069]: handled cleanup of child pid 26411 Jan 14 00:00:30 server spamd[18069]: handled cleanup of child pid 26593 Jan 14 00:00:30 server spamd[18069]: server killed by SIGTERM, shutting down Jan 14 00:00:30 server spamd[18069]: server hit by SIGCHLD Jan 14 00:00:30 server spamd[18069]: handled cleanup of child pid 26181 Jan 14 00:00:30 server spamd[18069]: handled cleanup of child pid 26215 (sa-learn --force-expire, takes several minutes, then /etc/init.d/spamassassin start) (this spamd process will immediately die, saying Address already in use) (A process named "spamd child", with pid 28370 is indeed listening to that port) Jan 14 00:04:12 server spamd[28387]: spamd starting (sleep 30, then /etc/init.d/postfix start) Jan 14 00:04:42 server postfix/postfix-script: starting the Postfix mail system Jan 14 00:04:42 server postfix/master[28449]: daemon started -- version 2.1.5 (several mails get processed by SA, all by the same spamd pid : 28370) Jan 14 00:04:50 server spamd[28370]: connection from server [127.0.0.1] at port 38402 Jan 14 00:04:50 server spamd[28370]: processing message ... Jan 14 00:05:01 server spamd[28370]: identified spam ... (... Other good behaving spamd lines ...) (several mails and minutes later, spamc/spamd run into problems) Jan 14 00:21:04 server spamc[29119]: connect(AF_INET) to spamd at 127.0.0.1 failed, retrying (#1 of 3): Connection refused (... Other missbehaving spamc lines ...) (my watchdog script, run hourly at HH+30 minutes) (no spamd is running, and a spamassassin start is issued, which effectively starts spamd) (this time, no process is listening on port 783 according to a second netstat output) Jan 14 00:30:00 server watchdog[29404]: spamd is not running : starting it Jan 14 00:30:00 server spamd[29410]: spamd starting Jan 14 00:30:02 server spamd[29410]: server started on port 783/tcp (running version 3.0.1) Jan 14 00:30:02 server spamd[29410]: server successfully spawned child process, pid 29426 Jan 14 00:30:02 server spamd[29410]: server successfully spawned child process, pid 29427 Jan 14 00:30:02 server spamd[29410]: server successfully spawned child process, pid 29428 Jan 14 00:30:02 server spamd[29410]: server successfully spawned child process, pid 29429 Jan 14 00:30:02 server spamd[29410]: server successfully spawned child process, pid 29430 ------------------------------------------------------------------------ In my case, 30 seconds after "postfix stop" (that is 30 seconds after midnight), the command "spamassassin stop" is started. This "spamassassin" script comes from RedHat 9 and issues a "killproc spamd". "killproc" is defined in the "functions" script, also provided by RedHat 9. This might issue a command like "kill -TERM" on "spamd" pid but also on "spamd child" pids too (I am not that familiar with shell scripts). I suspect this SIGTERM could hit some "spamd child" first, before hitting the parent spamd. In some cases, the parent spamd could have enough time (before being hit itself by SIGTERM) to receive a SIGCHLD, and to try to respawn a child (as it normally does, when hit by SIGCHLD). Reading my maillog file, it seems to me that all the 5 children pids seen by my watchdog script produce a line like "handled cleanup of child pid ...". But the parent spamd process has got enough time to spawn a new child with pid 28369, which seems to not get "cleanup handled". Some process (maybe this one 28369) start a new "spamd child" process, with pid 28370. This 28370 "spamd child" process listens on port 783 which seems strange, because in a normal situation, netstat reports "perl" as the listening process on port 783, not "spamd child". Maybe I should just change the /etc/init.d/spamassassin script to issue a "kill -TERM" only on the parent spamd process, not on all children, in order to see if it helps ? Maybe all my explanation is plain crap. Many thanks to anyone helping me to understand what is going on.