On Tue, Feb 23, 2010 at 12:35:22PM +1100, John Marshall wrote:
> Environment: sendmail 8.14.4 on FreeBSD 8.0-RELEASE-p2
> 
> Since upgrading a few local servers to FreeBSD 8.0-RELEASE (and
> subsequently 8.0-RELEASE-p2), I have been seeing VERY intermittent
> problems with sendmail persistent queue runners.  One or more queue
> runners will fail to wake up (having been told to sleep for either 1 or
> 5 seconds) and mail accumulates in their queue group queues.
> 
> I have only seen this about 4 times but at least once on each of the
> three 8.0 servers.  I've been seeing something like one occurrence per
> fortnight overall.  The first few times I re-started sendmail.  On
> Saturday I spent longer looking at it.
> 
>  - attached to each of the stuck queue runner processes via gdb to
>    try to see where they were stuck
>  - backtraces from both process were identical and looked sane
>  - attached to a happy queue runner process and got an identical
>    backtrace
>  - exited gdb and discovered that the stuck queue runners had woken
>    up and flushed their queues!
> 
> The stuck queue runner processes had been stuck for several hours
> (judging by the timestamps on the queued mail messages) but the gdb
> attach apparently woke them up!
> 
> PROCESS STATES BEFORE DEBUG (stuck runners are in 'I' state)
> 
>   PID  TT  STAT      TIME COMMAND
> 80298  ??  Ss     0:17.68 sendmail: accepting connections (sendmail)
> 80299  ??  I      0:46.62 sendmail: running queue: /var/spool/mqueue/qd1/df 
> (sendmail)
> 80300  ??  I      0:08.83 sendmail: running queue: /var/spool/mqueue/mby/df 
> (sendmail)
> 80301  ??  S      0:31.58 sendmail: running queue: /var/spool/mqueue/oz/df 
> (sendmail)
> 80302  ??  S      0:30.71 sendmail: running queue: /var/spool/mqueue/rw2/df 
> (sendmail)
> 80303  ??  S      0:33.29 sendmail: running queue: /var/spool/mqueue/hold/df 
> (sendmail)
> 80304  ??  S      0:30.55 sendmail: running queue: /var/spool/mqueue/pgp/df 
> (sendmail)
> 
> BACKTRACE OF STUCK PROCESS 80299
> 
> (gdb) bt
> #0  0x28346547 in sigsuspend () from /lib/libc.so.7
> #1  0x28344e98 in sigpause () from /lib/libc.so.7
> #2  0x2833be3e in pause () from /lib/libc.so.7
> #3  0x080cc7c8 in sleep ()
> #4  0x08099c51 in run_work_group ()
> #5  0x08099ebf in runqueue ()
> #6  0x0805538d in main ()
> 
> BACKTRACE OF HAPPY PROCESS 80301
> 
> (gdb) bt
> #0  0x28346547 in sigsuspend () from /lib/libc.so.7
> #1  0x28344e98 in sigpause () from /lib/libc.so.7
> #2  0x2833be3e in pause () from /lib/libc.so.7
> #3  0x080cc7c8 in sleep ()
> #4  0x08099c51 in run_work_group ()
> #5  0x08099ebf in runqueue ()
> #6  0x0805538d in main ()
> 
> PROCESS STATES AFTER DEBUG
> 
>   PID  TT  STAT      TIME COMMAND
> 80298  ??  Ss     0:17.69 sendmail: accepting connections (sendmail)
> 80299  ??  S      0:46.66 sendmail: running queue: /var/spool/mqueue/qd1/df 
> (sendmail)
> 80300  ??  S      0:08.85 sendmail: running queue: /var/spool/mqueue/mby/df 
> (sendmail)
> 80301  ??  S      0:31.60 sendmail: running queue: /var/spool/mqueue/oz/df 
> (sendmail)
> 80302  ??  S      0:30.73 sendmail: running queue: /var/spool/mqueue/rw2/df 
> (sendmail)
> 80303  ??  S      0:33.32 sendmail: running queue: /var/spool/mqueue/hold/df 
> (sendmail)
> 80304  ??  S      0:30.58 sendmail: running queue: /var/spool/mqueue/pgp/df 
> (sendmail)
> 
> SENDMAIL DETAILS
> 
> Version 8.14.4
>  Compiled with: DNSMAP LOG MAP_REGEX MATCHGECOS MILTER MIME7TO8 MIME8TO7
>               NAMED_BIND NETINET NETUNIX NEWDB NIS PIPELINING SASLv2 SCANF
>               STARTTLS USERDB XDEBUG
> 
> /usr/sbin/sendmail:
>       libsasl2.so.2 => /usr/local/lib/libsasl2.so.2 (0x28154000)
>       libssl.so.7 => /usr/local/lib/libssl.so.7 (0x2816a000)
>       libcrypto.so.7 => /usr/local/lib/libcrypto.so.7 (0x281ad000)
>       libutil.so.8 => /lib/libutil.so.8 (0x282f2000)
>       libc.so.7 => /lib/libc.so.7 (0x28300000)
>       libz.so.5 => /lib/libz.so.5 (0x2840c000)
> 
> I posted about this in comp.mail.sendmail and was told...
> 
> > sleep() should be one of these calls:
> > 
> >         if (njobs == 0 && WorkGrp[wgrp].wg_lowqintvl < MIN_SLEEP_TIME)
> >                 sleep(MIN_SLEEP_TIME);
> >         else if (WorkGrp[wgrp].wg_lowqintvl <= 0)
> >                 sleep(QueueIntvl > 0 ? QueueIntvl : MIN_SLEEP_TIME);
> >         else
> >                 sleep(WorkGrp[wgrp].wg_lowqintvl);
> > 
> > Unless you have a really large value for one of these, the process
> > should continue after a while.
> 
> The above code snippet is from sendmail/queue.c which fixes
> MIN_SLEEP_TIME at 5.  QueueIntvl defaults to 1.  wg_lowqintvl defaults
> to 0.  I have not set any configuration or runtime options to override
> these defaults, so my persistent queue runners should be sleeping for
> either 1s or 5s only (not hours!).

I think the best way to collect the data would be ktrace the queue runners,
preferrably starting the ktrace before they are stuck.

Attachment: pgpVpk6m6YyQs.pgp
Description: PGP signature

Reply via email to