Hello again,
>Notice 63 unpreprocessed messages and 63 bytes in trigger? Not a
>coinicidence. qmail-send isn't reading trigger. Is qmail-send still
>running? If so, strace it. What's it doing?
Here are the results of my strace. Let me preempt this with a strong
apology: Aside from cutting it short fro brevity's sake, I have munged the
specific domain data. I _know_ this is bad juju, and could not agree with
everyone more that real data should be used, but my employer has expressly
forbidden me from displaying it. And hopefully, it isn't relevant in
tackling this particular problem(if I ever have a mystery on my personal
servers, you'd better believe I'll give you every detail, including my dog's
name ;-) But, I digress; the strace:
[root@mail2 bin]# /usr/bin/strace /var/qmail/bin/qmail-send
execve("/var/qmail/bin/qmail-send", ["/var/qmail/bin/qmail-send"], [/* 23
vars */]) = 0
_sysctl({{CTL_KERN, KERN_OSRELEASE}, 2, "2.2.16-22smp", 12, NULL, 0}) = 0
brk(0) = 0x8056288
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
= 0x40017000
open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such file or
directory)
open("/etc/ld.so.cache", O_RDONLY) = 3
fstat64(3, 0xbffff35c) = -1 ENOSYS (Function not
implemented)
fstat(3, {st_mode=S_IFREG|0644, st_size=13562, ...}) = 0
old_mmap(NULL, 13562, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40018000
close(3) = 0
open("/lib/libc.so.6", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0755, st_size=4776568, ...}) = 0
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\220\274"..., 4096)
= 4096
old_mmap(NULL, 1196776, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x4001c000
mprotect(0x40137000, 37608, PROT_NONE) = 0
old_mmap(0x40137000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3,
0x11a000) = 0x40137000
old_mmap(0x4013d000, 13032, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x4013d000
close(3) = 0
munmap(0x40018000, 13562) = 0
getpid() = 8198
chdir("/var/qmail") = 0
open("control/me", O_RDONLY|O_NONBLOCK) = 3
read(3, "mail2.mycompanydomain.com\n", 64) = 18 #####Here is a line I
changed#####
close(3) = 0
open("control/queuelifetime", O_RDONLY|O_NONBLOCK) = -1 ENOENT (No such file
or directory)
open("control/concurrencylocal", O_RDONLY|O_NONBLOCK) = 3
read(3, "120\n", 64) = 4
close(3) = 0
open("control/concurrencyremote", O_RDONLY|O_NONBLOCK) = 3
read(3, "120\n", 64) = 4
close(3) = 0
open("control/envnoathost", O_RDONLY|O_NONBLOCK) = -1 ENOENT (No such file
or directory)
open("control/bouncefrom", O_RDONLY|O_NONBLOCK) = -1 ENOENT (No such file or
directory)
open("control/bouncehost", O_RDONLY|O_NONBLOCK) = -1 ENOENT (No such file or
directory)
open("control/doublebouncehost", O_RDONLY|O_NONBLOCK) = -1 ENOENT (No such
file or directory)
open("control/doublebounceto", O_RDONLY|O_NONBLOCK) = -1 ENOENT (No such
file or directory)
open("control/locals", O_RDONLY|O_NONBLOCK) = 3
read(3, "localhost\nmail2.mycompanydomain.com\n", 64) = 28 #####Here is a
line I changed#####
read(3, "", 64) = 0
close(3) = 0
open("control/percenthack", O_RDONLY|O_NONBLOCK) = -1 ENOENT (No such file
or directory)
open("control/virtualdomains", O_RDONLY|O_NONBLOCK) = 3
[...]
followed by approx 1200 lines, similar to the ones in this excerpt(domain
names changed):
[...]
read(3, "fobwear-net\nkrut.com:krut-co"..., 64) = 64
read(3, "practicalvision.com:practical"..., 64) = 64
brk(0x806b000) = 0x806b000
[...]
and finally ending with:
[...]
read(3, "", 64) = 0
close(3) = 0
chdir("queue") = 0
rt_sigaction(SIGPIPE, {SIG_IGN}, NULL, 8) = 0
rt_sigaction(SIGTERM, {0x8048b8c, [], 0x4000000}, NULL, 8) = 0
rt_sigaction(SIGALRM, {0x8048b9c, [], 0x4000000}, NULL, 8) = 0
rt_sigaction(SIGHUP, {0x8048bac, [], 0x4000000}, NULL, 8) = 0
rt_sigaction(SIGCHLD, {SIG_DFL}, NULL, 8) = 0
umask(077) = 022
open("lock/sendmutex", O_WRONLY|O_NONBLOCK) = 3
flock(3, LOCK_EX|LOCK_NB) = -1 EAGAIN (Resource temporarily
unavailable)
write(0, "alert: cannot start: qmail-send "..., 51alert: cannot start:
qmail-send is already running
) = 51
_exit(111) = ?
----------------------------------------------------------------------------
----
First off, I note the "(Resource temporarily unavailable)" just after
lock/sendmutex, so here's an ls -l of my /var/qmail/queue/lock directory in
case I've missed some other permission type issue:
-rw------- 1 qmails qmail 0 Jul 28 01:47 sendmutex
-rw-r--r-- 1 qmailr qmail 1024 Aug 6 13:39 tcpto
prw--w--w- 1 qmails qmail 0 Aug 6 13:41 trigger
In addion, it's that last line re: "qmail-send already running" that'll be
damning me, I think =) I was stopping and restarting qmail via the LWQ qmail
stop and qmail start commands, but I'm not the only one who has been on the
server, it looks like it was stopped/started some other way unclean. It
could still be me, though, because when I've done qmail stop and then qmail
start, I've seen lines such as this appear:
supervise: fatal: unable to acquire qmail-send/supervise/lock: temporary
failure
However, I just made another observation. I tried rebooting the server to
let it come up "clean" and insure all else is equal, and an strace of
qmail-send afterwards still leaves me with the following at the end of the
strace:
chdir("queue") = 0
rt_sigaction(SIGPIPE, {SIG_IGN}, NULL, 8) = 0
rt_sigaction(SIGTERM, {0x8048b8c, [], 0x4000000}, NULL, 8) = 0
rt_sigaction(SIGALRM, {0x8048b9c, [], 0x4000000}, NULL, 8) = 0
rt_sigaction(SIGHUP, {0x8048bac, [], 0x4000000}, NULL, 8) = 0
rt_sigaction(SIGCHLD, {SIG_DFL}, NULL, 8) = 0
umask(077) = 022
open("lock/sendmutex", O_WRONLY|O_NONBLOCK) = 3
flock(3, LOCK_EX|LOCK_NB) = -1 EAGAIN (Resource temporarily
unavailable)
write(0, "alert: cannot start: qmail-send "..., 51alert: cannot start:
qmail-send is already running
) = 51
_exit(111) = ?
So, assuming I'm diagnosing this correctly as per your suggestion, and to
clear one other point of ignorance on my part, what is the safe/clean way to
stop and restart qmail if that ever needs to be done? qmail stop, wait for a
certain amount of time, qmail start? Some other process? I'd be interested
to know so I can avoid causing this problem in the future. If you can show
me the light, I'll scour my scripts for any possibly guilty code. I do have
an automated service that checks a database to add/remove email accounts
automatically, and offhand I do know that this entails me calling
"bin/qmail-newu" and LWQ "qmail restart" frequently, as well as
adding/removing from the other key files, such as the rcpthosts. Any chance
that running either of those too quickly could force it into the state shown
in the strace? The intention of the qmail restart is to have qmail read the
updated rcpthosts and such. Aside from possibly calling it too frequently,
is a "qmail restart" the correct method of having qmail re-read the
rcpthosts file?
Thanks again, BTW. Hopefully I've provided enough info that you can show me
what steps I need to take from here.
--Matt Hubbard
-----Original Message-----
From: Dave Sill [mailto:[EMAIL PROTECTED]]
Sent: Monday, August 06, 2001 12:48 PM
To: [EMAIL PROTECTED]
Subject: Re: Sporadic preprocessed queue backlog
"Matt Hubbard" <[EMAIL PROTECTED]> wrote:
>Fairly frequently throughout an average day, my preprocessed queue will
>begin to grow steadily and not get processed. In most cases, if this is
>ignored, it resumes processing eventually. Sometimes after 15 or so
minutes,
>sometimes after a couple of hours, but at bad times, it can fail to clear
>out the preprocessed queue for days. I've checked the logs, and in no case
>is the concurrency peaked during this problem(in fact, local is usually low
>at 1/120 and remote usually at about 20 to 40/120), though I'm not sure if
>that would be related, anyway.
Strange.
>The first thing I checked, of course, is the /var/qmail/queue/lock/trigger
>file, as noted in the archives. As far as I can tell, it looks correct.
That would ahve been my first suggestion.
>Here is an example of my problem at 11:14am:
>
>qmail-qstat output:
>messages in queue: 228
>messages in queue but not yet preprocessed: 63
>
>trigger file at the time:
>prw--w--w- 1 qmails qmail 63 Aug 6 11:14 trigger
Notice 63 unpreprocessed messages and 63 bytes in trigger? Not a
coinicidence. qmail-send isn't reading trigger. Is qmail-send still
running? If so, strace it. What's it doing?
>The only piece I note is that trigger has a file size of 63 before and 0
>afterwards. Is it normal for this pipe to increase/decrease in size, or is
>that normal behaviour for a pipe?
That's normal pipe behaviour, but it's not normal for qmail-send to
not read bytes soon after they're written.
-Dave