Hello,

Using: spamassassin 3.2.5 on a CentOS 5.2 system.

Unfortunately the spamd process on one of our mail servers crashed early
this morning. The system mail log showed:

==========================================================
Jan 31 06:52:00 tracy spamd[23255]: spamd: connection from
localhost.localdomain [127.0.0.1] at port 45028
Jan 31 06:52:13 tracy spamd[2347]: spamd: server killed by SIGTERM,
shutting down
Jan 31 06:52:24 tracy spamd[26043]: server socket setup failed, retry 1:
spamd: could not create INET socket on 127.0.0.1:783: Address already in
use
Jan 31 06:52:25 tracy spamd[23255]: spamd: checking message
<200901310651.n0v6pxad026...@isg-prod-loader.informa.com> for
sauser:10001
Jan 31 06:52:25 tracy spamd[26043]: server socket setup failed, retry 2:
spamd: could not create INET socket on 127.0.0.1:783: Address already in
use
Jan 31 06:52:26 tracy spamd[26043]: spamd: could not create INET socket
on 127.0.0.1:783: Address already in use
Jan 31 06:52:31 tracy spamd[23255]: spamd: clean message (-6.6/8.0) for
sauser:10001in 30.9 seconds, 5194 bytes.
Jan 31 06:52:31 tracy spamd[23255]: spamd: result: . -6 -
BAYES_00,RCVD_IN_DNSWL_MEDscantime=30.9,size=5194,user=sauser,uid=10001,required_score=8.0,rhost=localhost.localdomain,raddr=127.0.0.1,rport=45028,mid=<200901310651.n0v6pxad026...@isg-prod-loader.informa.com>,bayes=0.000000,autolearn=ham
Jan 31 06:52:31 tracy spamd[23255]: syswrite() to parent failed: Broken
pipe
at /usr/lib/perl5/vendor_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 
576.
==========================================================


My first thought was a bug in the SpamdForkScaling.pm module, but I'm
not so sure.

At 06:52 spamd was fine, but we have an sa-update/sa-compile job that
runs at around that time. The files in /var/lib/spamassassin/compiled
indicate that the job was running (or finishing) at 06:52. The job (if
successful) then restarts spamassassin (using 'service spamassassin
restart').

Now, the above log shows that at 06:52:13 SA received a shutdown signal
- which is correct when restarting. But at 06:52:24 it seems to be
trying to startup but cannot because SA is still running (the port is in
use). Then at 06:52:31 it seems that some SA scan now finishes, and
because SA was trying to restart, the parent process was gone and,
hence, the syswrite error.

Okay, so looking at the SA startup script it shows (this is within a
shell 'case' statement):

==========================================================
  stop)                                                    
        # Stop daemons.                                    
        echo -n $"Stopping $prog: "                        
        killproc spamd                                     
        RETVAL=$?                                          
        echo                                               
        if [ $RETVAL = 0 ]; then
                rm -f /var/lock/subsys/spamassassin
                rm -f $SPAMD_PID
        fi
        ;;
  restart)
        $0 stop
        sleep 3
        $0 start
        ;;
==========================================================


I suspect the problem is that the 'stop' actually failed (RETVAL != 0).
But since the 'restart' doesn't check this, it then just went on and
tried to 'start' SA. This failed because SA still had a process/child
running. Ultimately it meant that our mail server ended up with SA not
running.

Perhaps the RedHat (and hence Fedora (I assume)/CentOS) startup script
should be a bit more aggressive in its checking that SA has actually
stopped before trying to start it again? I think I would rather that
more time was spent on ensuring that SA was stopped, so that it could
then start, rather than it completely failing and the server being left
without SA running.




John.

-- 
---------------------------------------------------------------
John Horne, University of Plymouth, UK  Tel: +44 (0)1752 587287
E-mail: john.ho...@plymouth.ac.uk       Fax: +44 (0)1752 587001

Reply via email to