> On Feb 16, 2022, at 1:28 PM, Jeremy Harris via Exim-users 
> <[email protected]> wrote:
> 
> On 16/02/2022 20:42, Michael Tratz via Exim-users wrote:
>> I usually can find a few messages per day on a server so if you need me to 
>> test a patch, please let me know.
> 
> Useful to know.  I assume that since you're running
> with those patches, the periodic daemon restart wworkaround
> has no effect?   How long has the daemon been up for?

No restarting the daemon doesn’t do anything. The only solution which works is 
to kill all exim processes and then restart the daemon.
exim will run through the queue and try again. The same messages which caused 
the delivery process to be stuck will again cause a stuck process. Once the 
retry rules have been exhausted (5 days) the message will be removed from the 
queue after a new restart and killing all exim processes. Otherwise those 
messages will just stick around. I had some which were in the queue for 20+ 
days until I noticed that issue.

This happens for messages which get a 4xx or 5xx error.

For messages which succeed with a 2xx code they stay stuck too. The exim mail 
queue output shows a D for delivered in the mail queue. But it never logs that 
the message was sent. Once you kill the stuck process, and use for example exim 
-d+all -M message_id the message will be removed from the mail queue. Again as 
soon as you use hosts_avoid_tls = * all these problem are gone. Here are some 
details for such a message with a 2xx code.

exiwhat | egrep -i 1nKPol-000KBV-Au
77630 delivering 1nKPol-000KBV-Au: waiting for a remote delivery subprocess to 
finish
77631 delivering 1nKPol-000KBV-Au to smtp.secureserver.net [68.178.213.37] 
([email protected])

ps output:

exim           77630   0.0  0.0  20620   7760  -  I    11:18        0:00.03 
/usr/local/sbin/exim -Mc 1nKPol-000KBV-Au
exim           77631   0.0  0.1  20620   8588  -  I    11:18        0:00.02 
/usr/local/sbin/exim -Mc 1nKPol-000KBV-Au

I tried truss to trace the system calls of both processes. 77631 is not 
printing anything. 77630 periodically writes:

truss -p 77630
wait4(-1,{ STOPPED,sig=127 },WNOHANG,0x0)        = 0 (0x0)
poll({ 24/POLLIN -1/POLLHUP|POLLNVAL|POLLINIGNEOF|0xe00 -1/0x0 -1/0x0 -1/0x0 
-1/0x0 -1/0x0 -1/0x0 -1/0x0 -1/0x0 -1/0x0 -1/0x0 -1/0x0 -1/0x0 -1/0x0 
},15,60000) = 0 (0x0)
wait4(-1,{ STOPPED,sig=127 },WNOHANG,0x0)        = 0 (0x0)

mailq | grep -1 1nKPol-000KBV-Au

 4h   14K 1nKPol-000KBV-Au <[email protected]>
        D [email protected] <mailto:[email protected]>

The crosslink.net <http://crosslink.net/> is also the only recipient of that 
message. So the message should be removed from the queue after it was 
successfully delivered. Then killing process 77630 and 77631

exim d+all -M 1nKPol-000KBV-Au (I removed the beginning of that output)

15:20:04 80137 set_process_info: 80137 delivering specified messages
15:20:04 80137 set_process_info: 80137 delivering 1nKPol-000KBV-Au
15:20:04 80137 Trying spool file /var/spool/exim//input//1nKPol-000KBV-Au-D
15:20:04 80137 reading spool file 1nKPol-000KBV-Au-H
15:20:04 80137 user=exim uid=60 gid=6 [email protected]
15:20:04 80137 sender_fullhost = ([192.168.1.2]) [68.227.83.85]:54443
15:20:04 80137 sender_rcvhost = [68.227.83.85] (port=54443 helo=[192.168.1.2])
15:20:04 80137 sender_local=0 ident=unset
15:20:04 80137 Non-recipients:
15:20:04 80137  Empty Tree
15:20:04 80137 ---- End of tree ----
15:20:04 80137 recipients_count=1
15:20:04 80137 **** SPOOL_IN - No additional fields
15:20:04 80137 body_linecount=345 message_linecount=22
15:20:04 80137 Previously delivered address [email protected] taken from 
journal file
15:20:04 80137 Writing spool header file: 
/var/spool/exim//input//hdr.1nKPol-000KBV-Au
15:20:04 80137 DSN: **** SPOOL_OUT - address: <[email protected]> errorsto: 
<NULL> orcpt: <NULL> dsn_flags: 0x0
15:20:04 80137 Renaming spool header file: 
/var/spool/exim//input//1nKPol-000KBV-Au-H
15:20:04 80137 Size of headers = 1043
15:20:04 80137 Delivery address list:
15:20:04 80137 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
15:20:04 80137 After routing:
15:20:04 80137   Local deliveries:
15:20:04 80137   Remote deliveries:
15:20:04 80137   Failed addresses:
15:20:04 80137   Deferred addresses:
15:20:04 80137 search_tidyup called
15:20:04 80137 >>>>>>>>>>>>>>>> deliveries are done >>>>>>>>>>>>>>>>
15:20:04 80137 changed uid/gid: post-delivery tidying
15:20:04 80137   uid=60 gid=6 pid=80137
15:20:04 80137   auxiliary group list: 6
15:20:04 80137 set_process_info: 80137 tidying up after delivering 
1nKPol-000KBV-Au
15:20:04 80137 Processing retry items
15:20:04 80137 Succeeded addresses:
15:20:04 80137 Failed addresses:
15:20:04 80137 Deferred addresses:
15:20:04 80137 end of retry processing
15:20:04 80137 LOG: MAIN
15:20:04 80137   Completed QT=4h1m41s
15:20:04 80137 end delivery of 1nKPol-000KBV-Au
15:20:04 80137 search_tidyup called
15:20:04 80137 search_tidyup called
15:20:04 80137 >>>>>>>>>>>>>>>> Exim pid=80137 (fresh-exec) terminating with 
rc=0 >>>>>>>>>>>>>>>>
                           

The log entry looks like this:

2022-02-16 11:18:23 1nKPol-000KBV-Au <= [email protected] H=([192.168.1.2]) 
[68.227.83.85]:54443 P=esmtpsa X=TLS1.2:ECDHE-RSA-AES256-GCM-SHA384:256 CV=no 
A=cram:[email protected] S=14523 
[email protected] from <[email protected]> 
for [email protected]
2022-02-16 11:31:00 1nKPol-000KBV-Au Spool file for 1nKPol-000KBV-Au is locked 
(another process is handling this message)
2022-02-16 11:46:00 1nKPol-000KBV-Au Spool file for 1nKPol-000KBV-Au is locked 
(another process is handling this message)
2022-02-16 12:01:00 1nKPol-000KBV-Au Spool file for 1nKPol-000KBV-Au is locked 
(another process is handling this message)
2022-02-16 12:16:00 1nKPol-000KBV-Au Spool file for 1nKPol-000KBV-Au is locked 
(another process is handling this message)
2022-02-16 12:31:00 1nKPol-000KBV-Au Spool file for 1nKPol-000KBV-Au is locked 
(another process is handling this message)
2022-02-16 12:46:00 1nKPol-000KBV-Au Spool file for 1nKPol-000KBV-Au is locked 
(another process is handling this message)
2022-02-16 13:01:00 1nKPol-000KBV-Au Spool file for 1nKPol-000KBV-Au is locked 
(another process is handling this message)
2022-02-16 13:16:00 1nKPol-000KBV-Au Spool file for 1nKPol-000KBV-Au is locked 
(another process is handling this message)
2022-02-16 13:31:00 1nKPol-000KBV-Au Spool file for 1nKPol-000KBV-Au is locked 
(another process is handling this message)
2022-02-16 13:46:00 1nKPol-000KBV-Au Spool file for 1nKPol-000KBV-Au is locked 
(another process is handling this message)
2022-02-16 14:01:00 1nKPol-000KBV-Au Spool file for 1nKPol-000KBV-Au is locked 
(another process is handling this message)
2022-02-16 14:16:00 1nKPol-000KBV-Au Spool file for 1nKPol-000KBV-Au is locked 
(another process is handling this message)
2022-02-16 14:31:00 1nKPol-000KBV-Au Spool file for 1nKPol-000KBV-Au is locked 
(another process is handling this message)
2022-02-16 14:46:00 1nKPol-000KBV-Au Spool file for 1nKPol-000KBV-Au is locked 
(another process is handling this message)
2022-02-16 15:01:00 1nKPol-000KBV-Au Spool file for 1nKPol-000KBV-Au is locked 
(another process is handling this message)
2022-02-16 15:16:00 1nKPol-000KBV-Au Spool file for 1nKPol-000KBV-Au is locked 
(another process is handling this message)
2022-02-16 15:19:44 1nKPol-000KBV-Au Spool file for 1nKPol-000KBV-Au is locked 
(another process is handling this message)
2022-02-16 15:20:04 1nKPol-000KBV-Au Completed QT=4h1m41s

But this issue doesn’t happen with every message which gets a 2xx code. Other 
messages get delivered just fine and log the missing delivery output  and are 
removed from the queue after delivery is complete.

I have not received any complaints of messages not getting delivered. Even 
though there is no delivery log entry the messages seems to get delivered 
correctly to the recipient.

> 
> If you see any indication of crashing processes, a stacktrace
> from a coredump would be of help.


The processes just sit around and do nothing. No crash. I have seen some 
processes hanging around since Saturday trying to deliver the message and they 
would stay that way if I don’t kill the process (daemon up since Saturday)

Thanks,

Michael

-- 
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/

Reply via email to