On Tue, Feb 23, 2010 at 12:35:22PM +1100, John Marshall wrote: > Environment: sendmail 8.14.4 on FreeBSD 8.0-RELEASE-p2 > > Since upgrading a few local servers to FreeBSD 8.0-RELEASE (and > subsequently 8.0-RELEASE-p2), I have been seeing VERY intermittent > problems with sendmail persistent queue runners. One or more queue > runners will fail to wake up (having been told to sleep for either 1 or > 5 seconds) and mail accumulates in their queue group queues. > > I have only seen this about 4 times but at least once on each of the > three 8.0 servers. I've been seeing something like one occurrence per > fortnight overall. The first few times I re-started sendmail. On > Saturday I spent longer looking at it. > > - attached to each of the stuck queue runner processes via gdb to > try to see where they were stuck > - backtraces from both process were identical and looked sane > - attached to a happy queue runner process and got an identical > backtrace > - exited gdb and discovered that the stuck queue runners had woken > up and flushed their queues! > > The stuck queue runner processes had been stuck for several hours > (judging by the timestamps on the queued mail messages) but the gdb > attach apparently woke them up! > > PROCESS STATES BEFORE DEBUG (stuck runners are in 'I' state) > > PID TT STAT TIME COMMAND > 80298 ?? Ss 0:17.68 sendmail: accepting connections (sendmail) > 80299 ?? I 0:46.62 sendmail: running queue: /var/spool/mqueue/qd1/df > (sendmail) > 80300 ?? I 0:08.83 sendmail: running queue: /var/spool/mqueue/mby/df > (sendmail) > 80301 ?? S 0:31.58 sendmail: running queue: /var/spool/mqueue/oz/df > (sendmail) > 80302 ?? S 0:30.71 sendmail: running queue: /var/spool/mqueue/rw2/df > (sendmail) > 80303 ?? S 0:33.29 sendmail: running queue: /var/spool/mqueue/hold/df > (sendmail) > 80304 ?? S 0:30.55 sendmail: running queue: /var/spool/mqueue/pgp/df > (sendmail) > > BACKTRACE OF STUCK PROCESS 80299 > > (gdb) bt > #0 0x28346547 in sigsuspend () from /lib/libc.so.7 > #1 0x28344e98 in sigpause () from /lib/libc.so.7 > #2 0x2833be3e in pause () from /lib/libc.so.7 > #3 0x080cc7c8 in sleep () > #4 0x08099c51 in run_work_group () > #5 0x08099ebf in runqueue () > #6 0x0805538d in main () > > BACKTRACE OF HAPPY PROCESS 80301 > > (gdb) bt > #0 0x28346547 in sigsuspend () from /lib/libc.so.7 > #1 0x28344e98 in sigpause () from /lib/libc.so.7 > #2 0x2833be3e in pause () from /lib/libc.so.7 > #3 0x080cc7c8 in sleep () > #4 0x08099c51 in run_work_group () > #5 0x08099ebf in runqueue () > #6 0x0805538d in main () > > PROCESS STATES AFTER DEBUG > > PID TT STAT TIME COMMAND > 80298 ?? Ss 0:17.69 sendmail: accepting connections (sendmail) > 80299 ?? S 0:46.66 sendmail: running queue: /var/spool/mqueue/qd1/df > (sendmail) > 80300 ?? S 0:08.85 sendmail: running queue: /var/spool/mqueue/mby/df > (sendmail) > 80301 ?? S 0:31.60 sendmail: running queue: /var/spool/mqueue/oz/df > (sendmail) > 80302 ?? S 0:30.73 sendmail: running queue: /var/spool/mqueue/rw2/df > (sendmail) > 80303 ?? S 0:33.32 sendmail: running queue: /var/spool/mqueue/hold/df > (sendmail) > 80304 ?? S 0:30.58 sendmail: running queue: /var/spool/mqueue/pgp/df > (sendmail) > > SENDMAIL DETAILS > > Version 8.14.4 > Compiled with: DNSMAP LOG MAP_REGEX MATCHGECOS MILTER MIME7TO8 MIME8TO7 > NAMED_BIND NETINET NETUNIX NEWDB NIS PIPELINING SASLv2 SCANF > STARTTLS USERDB XDEBUG > > /usr/sbin/sendmail: > libsasl2.so.2 => /usr/local/lib/libsasl2.so.2 (0x28154000) > libssl.so.7 => /usr/local/lib/libssl.so.7 (0x2816a000) > libcrypto.so.7 => /usr/local/lib/libcrypto.so.7 (0x281ad000) > libutil.so.8 => /lib/libutil.so.8 (0x282f2000) > libc.so.7 => /lib/libc.so.7 (0x28300000) > libz.so.5 => /lib/libz.so.5 (0x2840c000) > > I posted about this in comp.mail.sendmail and was told... > > > sleep() should be one of these calls: > > > > if (njobs == 0 && WorkGrp[wgrp].wg_lowqintvl < MIN_SLEEP_TIME) > > sleep(MIN_SLEEP_TIME); > > else if (WorkGrp[wgrp].wg_lowqintvl <= 0) > > sleep(QueueIntvl > 0 ? QueueIntvl : MIN_SLEEP_TIME); > > else > > sleep(WorkGrp[wgrp].wg_lowqintvl); > > > > Unless you have a really large value for one of these, the process > > should continue after a while. > > The above code snippet is from sendmail/queue.c which fixes > MIN_SLEEP_TIME at 5. QueueIntvl defaults to 1. wg_lowqintvl defaults > to 0. I have not set any configuration or runtime options to override > these defaults, so my persistent queue runners should be sleeping for > either 1s or 5s only (not hours!).
I think the best way to collect the data would be ktrace the queue runners, preferrably starting the ktrace before they are stuck.
pgpVpk6m6YyQs.pgp
Description: PGP signature