I see Mark answered some of the same questions already, but it would
be really painstaking to avoid duplication (it took more than an hour
to write this :-), so I'm just gonna scan it quickly and then send.

Stephan Krinetzki writes:
 > My queues:

(Omitting the empty ones.)

 > /opt/mailman/var/queue/out:
 > total 2868
 > drwxrwx---  2 mailman mailman    4096 Jul 31 13:26 .
 > drwxr-xr-x 14 mailman mailman     165 Jun 27  2024 ..
 > -rw-rw----  1 mailman mailman  221708 Jul 30 13:16 
 > 1753874180.0845337+0cc7849043859a79dc3678a0d8b63c1c66df0c66.pck.tmp
 > -rw-rw----  1 mailman mailman  733425 Jul 31 00:00 
 > 1753912841.8847518+da55f8789ae41b75d19a02b595e3fd6d45983ade.pck.tmp
 > -rw-rw----  1 mailman mailman   17758 Jul 31 13:26 
 > 1753961185.0481033+6adf966467266567275f1146f5054c95e4365c13.pck

Those two .pck.tmp files are bad news.  They indicate that Mailman was
trying to do something with those messages, and the process was
interrupted.  You should check whether there was a Mailman restart at
those times (although it should shutdown gracefully and not leave .tmp
files behind), or if a runner crashed.

The .tmp files *may* be deliverable, but you'd have to look at them to
be sure that they are complete.  It's possible that they have been
delivered already and the .tmp files just need to be removed.  You can
look at them with "mailman qfile" same as always, "qfile" doesn't
check the filename extension.  If they haven't been delivered and a
careful check shows they're intact, just renaming without the .tmp
will cause them to go to the head of the queue.

It's also odd that the .pck above precedes the ,bak below (unless you
have multiple slices for the out queue?)

The rest of the queue looks normal, except that it seems rather long.
I only see queues that long when the outgoing MTA is borked.
(Although my experience with high-traffic systems is restricted to
helping a couple of folks for whom there was zero cost to adding CPUs
and memory to their VMs, I feel better that Mark picked up on this
too.)  You might want to reconfigure Mailman to use more out slices,
but that depends on what else your MTA should be using its bandwidth
for.  Because of the way the slicing algorithm works, the number of
slices needs to be a power of 2, so the number of simultaneous
connections Mailman makes to the MTA will double.  (I don't think
there's any point to more than 4 slices unless you're doing more than
one incoming post/second.)

 > -rw-rw----  1 mailman mailman   31122 Jul 31 13:26 
 > 1753961185.1064982+b1f502e2af56b9b11680135c6de5fcc5285d967e.bak

This .bak file is currently being processed by Mailman, it's normal.
The rest of the .pck files are also normal, just waiting.  (Omitting
the rest of the out queue listing.)

 > /opt/mailman/var/queue/shunt:
 > total 3304
 > drwxrwx---  2 mailman mailman    4096 Jul 31 11:44 .
 > drwxr-xr-x 14 mailman mailman     165 Jun 27  2024 ..
 > -rw-rw----  1 mailman mailman     451 Jul 31 00:00 
 > 1753912822.2651796+28eceef7e18eb70393377b88dc7117af8f9362a0.pck
 > -rw-rw----  1 mailman mailman     490 Jul 31 00:00 
 > 1753912838.4197352+ea531cf0262c1faa58b1679b907fee92bc16822c.pck
 > -rw-rw----  1 mailman mailman 1407870 Jul 31 00:00 
 > 1753912841.9177196+ccea15bdefce3a54301281c8eddf86e8230244a6.pck
 > -rw-rw----  1 mailman mailman   86108 Jul 31 00:00 
 > 1753912841.9197443+7dcef4febc71e44c6d9309a24a08b08753e1ff42.pck
 > -rw-rw----  1 mailman mailman 1407668 Jul 31 00:00 
 > 1753912841.9849963+a3f1869b750060c97262ece38737480d91652828.pck
 > -rw-rw----  1 mailman mailman   38992 Jul 31 00:01 
 > 1753912860.5167956+940584c4f361cbd8c29e390b2f60590558effe40.pck
 > -rw-rw----  1 mailman mailman     440 Jul 31 00:01 
 > 1753912860.6972685+635c065bac8dff5f9d562275d707001d773b84c1.pck
 > -rw-rw----  1 mailman mailman     445 Jul 31 00:01 
 > 1753912868.7903054+befa066f254d7a3529a8555a6c942a554715d837.pck
 > -rw-rw----  1 mailman mailman   33494 Jul 31 00:01 
 > 1753912878.7562895+82b724fd93260ab9a2bb49709d3a42a2f32f2c80.pck
 > -rw-rw----  1 mailman mailman  217073 Jul 31 00:01 
 > 1753912878.9337828+de73a65d9c6febfa80275853921f4b53fd1d9e2a.pck
 > -rw-rw----  1 mailman mailman   85888 Jul 31 00:02 
 > 1753912950.303359+8a87bce0be63ac1df8493c6b1ad6ae154fcedba7.pck
 > -rw-rw----  1 mailman mailman   50244 Jul 31 00:02 
 > 1753912950.4970112+d44d493912bb3024547b8a5112f86f035dcb352f.pck
 > -rw-rw----  1 mailman mailman   12887 Jul 31 00:02 
 > 1753912970.038427+31bcbd7fb2ebdf81f6de24b7283b50bcda6ded21.pck.tmp
 > -rw-rw----  1 mailman mailman     443 Jul 31 11:44 
 > 1753955094.900898+67bc76525412da66a7c76363f65f583989716305.pck
 > 
 > /opt/mailman/var/queue/virgin:
 > total 32
 > drwxrwx---  2 mailman mailman    81 Jul 31 13:17 .
 > drwxr-xr-x 14 mailman mailman   165 Jun 27  2024 ..
 > -rw-rw----  1 mailman mailman 32013 Jan 11  2025 
 > 1736550035.6163204+472f81ece5e45a2651a4499bef418f611b43c619.pck.tmp
 > 
 > Nothing special there (shunt should be checked, but not in
 > correlation with my mail).

I tend to disagree, as the first series of shunt files ends with a
.tmp. There's another one of those .tmp files in virgin, and it's 6
months old.  Hmmm, that one is *also* on the hour.  You got lots of
cron jobs that run on the hour, maybe?

You're probably right that there's no correlation, but you can't trust
the dates from ls -l or stat because when running "mailman unshunt"
all of the queue files in shunt will get "touched" if they're not
sent.  (If I recall correctly.)  The fact that a spate of timestamps
occur right at 00:00 means either there's a cron job running unshunt
then, or you have a spammer or similar sending a bunch of broken mail
to you on the hour.  (I say you're probably right because the time
stamp in the name decodes to the same time, and I don't think that
changes when unshunt is run.)  And again, you have a stale .tmp file
there, which means something bad happened, most likely not under
Mailman's control.

There is (maybe was. by now?) a bug in the logging such that logs did
not get properly rotated.  Many sites dealt with this by restarting
Mailman with the same period of the log rotation.  Do you do that?
(I'm just fishing, I don't know how it could cause the main issue you
are seeing.)

 > mailq is empty, so my postfix works as expected.

Hm.  Those "high traffic" sites I mentioned, it was the other way
around: with 4 (or 8) "out" slices, the out queue would be clear >80%
of the time (according to "while 1; do ls -l $OUTQUEUE; sleep 5; done",
nothing sophisticated).  But the MTA's mail queue would typically
backlog many minutes.  As I said, the Mailman hosts at those sites
were insanely overpowered, so your mileage will vary.

I have to think that there is a problem in the handoff between Mailman
and the MTA.  Why Mailman is not preserving the queuefile or
alternatively logging a successful delivery to the MTA I don't have
any idea off hand.  I have to think the queue runner is crashing, but
that doesn't explain why this happens only to certain lists.

Is Postfix delivering to the final destination itself, or does it pass
on the messages to a smarthost?  Is Mailman talking to the local MTA,
or is it possibly talking to an MTA on a different node?  I did have
to diagnose a problem once where a system was misconfigured, and
Mailman was talking not to the local Postfix but to a Postfix in a
datacenter a megameter or so away!  (That didn't lose any mail, but
the connection would occasionally freeze and not time out, leading to
a huge build up in the out queue.)  Anyway, if Mailman isn't talking
to an MTA with a <50ms ping time, you could try changing the
configuration so it does.

I don't put much stock in any of the above ideas.  I hope that you or
somebody come up with better ones!


-- 
GNU Mailman consultant (installation, migration, customization)
Sirius Open Source    https://www.siriusopensource.com/
Software systems consulting in Europe, North America, and Japan
_______________________________________________
Mailman-users mailing list -- mailman-users@mailman3.org
To unsubscribe send an email to mailman-users-le...@mailman3.org
https://lists.mailman3.org/mailman3/lists/mailman-users.mailman3.org/
Archived at: 
https://lists.mailman3.org/archives/list/mailman-users@mailman3.org/message/2APBZJR2YZOOUHJKDRBB37K5TDEKZJ2E/

This message sent to arch...@mail-archive.com

Reply via email to