I'm looking to write a utility to do some processing on email messages
stored in mbox format.  Some mbox files can be quite large, hundreds of
megs or perhaps gigs in size.  Obviously, reading in the whole file at
once isn't feasible.  The most obvious method is to set $/ to the
regex /\n\nFrom / (messages in mbox format are seperated by a blank line
and begin with a From line) and to read in email messages one at a time.
It seems to me that this would be quite slow.  Another possibility that
springs to mind is to read in chunks 64k or so chunks of data and then
split those chunks into individual messages.  This will complicate the
program logic, however, as the chunks will inevitably split the last
message in two.  I'd then either have to back up the offset into the
file to point to the begging of the message or to store the beginning of
the message, read in a new chunk, get the last half of the message off
the new chunk, combine it with the stored beginning of the message, then
process it.

I'm aware that there are a number of modules which deal with mail and
mbox handling, but so far none of them seem to make doing what I'm
trying to do easy.  Reinventing the wheel isn't always a waste of time -
it's sometimes a very good way to learn how wheels are constructed and
how to use your tools to construct wheels.  This gives you insight and
practice when you have to use those same tools to construct
non-wheels. :)

Any thoughts or pointers to discussions on how to handle large files in
Perl would be welcome.




-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>


Reply via email to