On Fri, 2013-11-29 at 23:13 +0100, Gino Semseo wrote: > 2013-11-29 20:29:01 1VmTkL-0002F8-5Q spam acl condition: cannot parse spamd > output
> Nov 29 16:00:55 hyperserver spamd[38925]: prefork: ordered 38942 to accept > Nov 29 16:00:55 hyperserver spamd[38925]: prefork: sysread(7) not ready, wait > max 300 secs Why does sysread() fail? That's the parent process, so reading from the child failed in M::SA::SpamdForkScaling sysread_with_timeout(), and we're falling back to blocking I/O. > Nov 29 16:00:55 hyperserver spamd[38925]: prefork: child 38942: entering > state 2 > Nov 29 16:00:55 hyperserver spamd[38925]: prefork: new lowest idle kid: 38944 The child we just ordered to process a message now is busy (state 2). Following lines are logged by the child. > Nov 29 16:00:55 hyperserver spamd[38942]: spamd: connection from localhost > [127.0.0.1] at port 44846 > Nov 29 16:00:55 hyperserver spamd[38942]: spamd: setuid to diversia succeeded > Nov 29 16:00:55 hyperserver spamd[38942]: config: read_scoreonly_config: > cannot open "/home/diversia/.spamassassin/user_prefs": No such file or > directory > Nov 29 16:00:55 hyperserver spamd[38942]: info: user has changed > Nov 29 16:00:55 hyperserver spamd[38942]: bayes: DB journal sync: last sync: > 1385736975 Last Bayes journal sync is just 5 minutes ago, Nov 29 15:56:15. That's not necessarily bad, but seems a little odd given your configuration of learning to journal: bayes_learn_to_journal 1 bayes_journal_max_size 100000 The journal max size probably is not what you expect it to be. That custom setting is hardly sufficiently different from the default 102400, to even care about it. > Nov 29 16:00:55 hyperserver spamd[38942]: config: score set 3 chosen. > Nov 29 16:00:55 hyperserver spamd[38942]: spamd: running as uid 574 > Nov 29 16:00:55 hyperserver spamd[38942]: config: time limit 300.0 s > Nov 29 16:00:55 hyperserver spamd[38942]: message: main message type: > multipart/mixed The child obviously is processing a message, and the child's last few logged debug messages do not indicate any issue. Next is the parent noticing its child's death. > Nov 29 16:00:55 hyperserver spamd[38925]: prefork: child closed connection > Nov 29 16:00:55 hyperserver spamd[38925]: prefork: child states: BIIIII > Nov 29 16:00:55 hyperserver spamd[38925]: prefork: child 38942: just exited > Nov 29 16:00:55 hyperserver spamd[38925]: prefork: child 38942: entering > state 4 State 4, parent received SIGCHL. > Nov 29 16:00:55 hyperserver spamd[38925]: prefork: new lowest idle kid: 38944 > Nov 29 16:00:55 hyperserver spamd[38925]: spamd: handled cleanup of child pid > [38942] due to SIGCHLD: exit 1 > Nov 29 16:00:55 hyperserver spamd[38925]: prefork: new lowest idle kid: 38944 > Nov 29 16:00:55 hyperserver spamd[38925]: prefork: select returned err > Interrupted system call, probably signalled That would be due to the above sysread() failing. We backed down to using select() and timeouts. Wait, the code in question does ring some bells with the full thread in mind. That debug message is unique in the SA code, and comes with a nice explanation in M::SA::SpamdForkScaling: if (exists &Errno::EINTR && $selerr == &Errno::EINTR) { # this happens if the process is signalled during the select(), # for example if someone sends SIGHUP to reload the configuration. # just return inmmediately dbg("prefork: select returned err $selerr, probably signalled"); return; } Indeed, we did have to revert to select(). Might we also have received SIGHUP to reload the configuration? Like, say, a massive whitelist being updated, and us being requested to immediately pull it in? (Thanks Martin Gregorie for pointing out that game of "whitelist changed, reload daemon" procedure.) Gino, are there any processes or cron jobs, that recreate the whitelist more frequently than once a day? How frequently? Do they signal spamd about the configuration change? If so, you owe both me and Martin a beer. ;) > Nov 29 16:00:55 hyperserver spamd[38925]: spamd: server successfully spawned > child process, pid 39240 > Nov 29 16:00:55 hyperserver spamd[38925]: prefork: child 39240: entering > state 0 State 0, a new child is born. Off to business as usual... > If I just execute spamassassin -t -D from shell, most times there is no > error and the message is fully scanned... "MOST times there is no error". Emphasis mine. So you cannot reliably reproduce the issue, and there is no test-case message. It just happens, at random -- or rather at times other processes meddle with your spamd daemon? -- char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1: (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}