Hello,

I'm enhancing a program I use to follow Postfix logs in sort of real-time. It 
does a couple of things, and the most important is to record a single-line 
summaries for each message, storing source IP address, from, to, timestamp, 
size, final delivery destination (mail relay or mailbox) and status. In order 
to keep up with a lot of activity, I need to keep some data in memory, so that 
I can summarize all the log lines describing a message process.

The problem is: many things can go wrong in a message delivery, and, worse, 
syslog, based on UDP, can lose datagrams due to different causes. So such a 
program running for a long time could end up storing a lot of incomplete 
messages in memory. I am planning to use timestamps in order to flush 
longstanding messages, but of course I don't want to discard too much 
incomplete data.

Looking at the progress of an email message, I think I've found a possible 
problem (for me) with logging. Let's see a normal example:

1- Client connects ->  smtpd records a "connect from..."

2- client sends valid "HELO"

3- client sends valid "MAIL FROM:"

4- client sends valid "RCPT TO:" -> smtpd logs a "client=" line with a queue id

(at this point, I store the source IP address and queue id for the message)

5- client sends "DATA" and message headers, body.

6- client finishes sending body, sends "." -> cleanup logs "message-id="

7- qmgr logs "from="

8- transport (local, etc) logs 'to=' with delivery status, etc.

9- qmgr logs "queue id removed"

10- smtpd logs "disconnect"



So far, so good, this one is trivial to follow. But, what happens if the client 
isn't so predictable? Let's see an example:

Zaphod:~ borjam$ telnet 172.16.2.117 smtp
Trying 172.16.2.117...
Connected to 172.16.2.117.
Escape character is '^]'.
220 pruebazfs1.sarenet.es ESMTP Postfix
HELO testing         
250 pruebazfs1.sarenet.es
MAIL FROM: <bor...@sarenet.es>
250 2.1.0 Ok
RCPT TO: <bor...@pruebazfs1>
250 2.1.5 Ok
quit
221 2.0.0 Bye
Connection closed by foreign host.



And the log records this:

Feb 23 15:08:20 pruebazfs1 postfix/smtpd[4800]: connect from 
unknown[172.16.1.204]
Feb 23 15:08:56 pruebazfs1 postfix/smtpd[4800]: 522243F471: 
client=unknown[172.16.1.204]
Feb 23 15:09:00 pruebazfs1 postfix/smtpd[4800]: disconnect from 
unknown[172.16.1.204]

After this, I have recorded a (queue id, ip address) tuple, and the client 
disconnected. So it could be a good bet to just remove the aborted 
(queue_id,ip_addr) tuple from my list.  But I'm worried that I could lose 
messages if the server is slow. The "message-id=" line is logged by a different 
program, cleanup.

Doing a pipelined send, I've seen that the progress is logged as follows:

Feb 23 15:13:33 pruebazfs1 postfix/smtpd[4819]: connect from 
unknown[172.16.1.204]
Feb 23 15:13:33 pruebazfs1 postfix/smtpd[4819]: improper command pipelining 
after HELO from unknown[172.16.1.204]
Feb 23 15:13:33 pruebazfs1 postfix/smtpd[4819]: 4DEAD418A8: 
client=unknown[172.16.1.204]
+++Feb 23 15:13:33 pruebazfs1 postfix/cleanup[4823]: 4DEAD418A8: 
message-id=<20100223141333.4dead41...@pruebazfs1.sarenet.es>
Feb 23 15:13:33 pruebazfs1 postfix/qmgr[4746]: 4DEAD418A8: 
from=<bor...@sarenet.es>, size=358, nrcpt=1 (queue active)
+++Feb 23 15:13:33 pruebazfs1 postfix/smtpd[4819]: disconnect from 
unknown[172.16.1.204]

I'm marking the two interesting lines with a "+++". 

I could just discard a (queue_id, ip_addr) tuple in case I see a "disconnect" 
line from the same smtpd process. But, is it possible for the "message-id=" 
line from "cleanup" to be delayed?

I've noticed that smtpd, upon reception of a valid "rcpt to", connects to a 
cleanup process and there seems to be a strong coupling between the two.

Is there, maybe, a guarantee (due to the program flow in smtpd) that the 
"message-id=" line from cleanup will always be logged before the "disconnected" 
line from smtpd? Is it a synchronous coupling, or messages sent from smtpd go 
to a cleanup queue to be processed asynchronously, and hence it would be 
possible to see a "disconnected" smtpd line before the "message-id=" line from 
cleanup?


I hope this message is understandable ;) I need to avoid memory leaks, but I 
don't want to be over aggressive with discards. Mail servers can be painfully 
slow now and then, under heavy loads... 

By the way, wouldn't it be better to have a message similar to the "queue_id 
removed" logged by qmgr, when this happens? I mean, a queue id has been 
allocated  by the smtpd/cleanup pair, and in case the client disconnects before 
"DATA" there's no other mention of that queue id on the system log.



Thank you very much in advance,







Borja.

Reply via email to