Re: Reading Huge UnixMailbox Files

Nobody Tue, 26 Apr 2011 13:35:52 -0700

On Tue, 26 Apr 2011 15:39:37 -0400, Brandon McGinty wrote:

> I'm trying to import hundreds of thousands of e-mail messages into a
> database with Python.
> However, some of these mailboxes are so large that they are giving
> errors when being read with the standard mailbox module.
> I created a buffered reader, that reads chunks of the mailbox, splits
> them using the re.split function with a compiled regexp, and imports
> each chunk as a message.
> The regular expression work is where the bottle-neck appears to be,
> based on timings.
> I'm wondering if there is a faster way to do this, or some other method
> that you all would recommend.


Consider using awk. In my experience, high-level languages tend to have
slower regex libraries than simple tools such as sed and awk.

E.g. the following script reads a mailbox on stdin and writes a separate
file for each message:

        #!/usr/bin/awk -f
        BEGIN {
                num = 0;
                ofile = "";
        }
        
        /^From / {
                if (ofile != "") close(ofile);
                ofile = sprintf("%06d.mbox", num);
                num ++;
        }
        
        {
                print > ofile;
        }

It would be simple to modify it to start a new file after a given number
of messages or a given number of lines.

You can then read the resulting smaller mailboxes using your Python script.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Reading Huge UnixMailbox Files

Reply via email to