On Tue, 26 Apr 2011 15:39:37 -0400, Brandon McGinty wrote: > I'm trying to import hundreds of thousands of e-mail messages into a > database with Python. > However, some of these mailboxes are so large that they are giving > errors when being read with the standard mailbox module. > I created a buffered reader, that reads chunks of the mailbox, splits > them using the re.split function with a compiled regexp, and imports > each chunk as a message. > The regular expression work is where the bottle-neck appears to be, > based on timings. > I'm wondering if there is a faster way to do this, or some other method > that you all would recommend.
Consider using awk. In my experience, high-level languages tend to have slower regex libraries than simple tools such as sed and awk. E.g. the following script reads a mailbox on stdin and writes a separate file for each message: #!/usr/bin/awk -f BEGIN { num = 0; ofile = ""; } /^From / { if (ofile != "") close(ofile); ofile = sprintf("%06d.mbox", num); num ++; } { print > ofile; } It would be simple to modify it to start a new file after a given number of messages or a given number of lines. You can then read the resulting smaller mailboxes using your Python script. -- http://mail.python.org/mailman/listinfo/python-list