On Wed, 2004-04-07 at 20:26, WC -Sx- Jones wrote:
> Traeder, Philipp wrote:
> >>-----Original Message-----
> 
> Should we take this back to the perl list?  After all we
> are talking about writing Perl.

Yes, of course.
If it's getting too module-specific, we can still switch to private
mails.


> > 
> > Therefore, I've got a problem: Using a more general approach, I would not be
> > allowed to apply my "from" filter, but I would have to parse the complete
> > file before
> > filtering. But this would be much less performant than what I'm currently
> > doing.
> > The question is: Is it possible to write a performant module that can be
> > used in
> > different scenarios? Or is this approach too theoretical?
> > If somebody wants to parse a 2GB logfile, he should be prepared to wait some
> > time,
> > shouldn´t he?
> 
> 
> At any rate, if you have a 2GB log file I would seriously
> consider only grep'ing thru it to reduce the initial
> processing impact.

That was my thought as well - though "real" grepping (as in calling
grep) won´t work here because multiple lines make up one record.
Anyway - I'd like to filter the input as soon as possible.

> 
> Also, if you have bounces they prolly look like this:
> 
> Apr  6 21:28:34 chasecreek postfix/smtpd[8394]: [ID 197553 mail.info] 
> 40FA61375E: client=ns.suse.de[195.135.220.2]
> Apr  6 21:28:34 chasecreek postfix/cleanup[8395]: [ID 197553 mail.info] 
> 40FA61375E: message-id=<[EMAIL PROTECTED]>
> Apr  6 21:28:34 chasecreek postfix/qmgr[4787]: [ID 197553 mail.info] 
> 40FA61375E: from=<>, size=3724, nrcpt=1 (queue active)
> Apr  6 21:28:34 chasecreek postfix/local[8396]: [ID 197553 mail.info] 
> 40FA61375E: to=<[EMAIL PROTECTED]>, relay=local, delay=0, status=sent 
> (delivered to mailbox)
> Apr  6 21:28:34 chasecreek postfix/qmgr[4787]: [ID 197553 mail.info] 
> 40FA61375E: removed

My logs look a bit different - something like this:
Jan 31 23:34:11 ns1 sm-mta[10966]: i0VMYAG5010966:
from=<[EMAIL PROTECTED]>, [..]
msgid=<[EMAIL PROTECTED]>,
proto=ESMTP, daemon=MTA, relay=[relay_ip_address]
[..]
Jan 31 23:34:12 ns1 sm-mta[10973]: i0VMYAG5010966:
to=<[EMAIL PROTECTED]>, delay=00:00:02,
xdelay=00:00:01, mailer=esmtp, pri=30388, relay=mx01.somewhere.com.
[relay_ip_address], dsn=2.0.0, stat=Sent (OK id=1An3gr-00043m-00)


> So, unless your logging software is clairevoyent I do not see how
> you will know the "original" from unless you have kept ALL of
> the original bounced messages...

In this scenario, I'm particularly interested in the entries starting
with "to" resp. "from". I'm assuming that the "from" entry comes as the
first "relevant" record for my purposes...therefore I go through the log
line by line and check if it's a "from" record from one of the addresses
I'm interested in. If yes, I write the complete entry into an output
file and keep the (queue) ID in memory.
For all other (not "from") records I just check if the id is one of
those that I'm interested in and write it into the same output file if
necessary.

I hope that this will allow me to parse big log files in a reasonable
amount of time, but I'm afraid this particular approach won't work for a
module. That's no big problem in itself - I could finish my script, and
independently we start writing the general module for parsing sendmail
log files. I'm just wondering if this scenario of big log files that
need to be filtered one way or another isn't a quite common one.

Just my 5 cents,

Philipp


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>


Reply via email to