On Tue, Apr 26, 2011 at 12:39 PM, Brandon McGinty <brandon.mcgi...@gmail.com> wrote: > List, > I'm trying to import hundreds of thousands of e-mail messages into a > database with Python. > However, some of these mailboxes are so large that they are giving > errors when being read with the standard mailbox module. > I created a buffered reader, that reads chunks of the mailbox, splits > them using the re.split function with a compiled regexp, and imports > each chunk as a message. > The regular expression work is where the bottle-neck appears to be, > based on timings. > I'm wondering if there is a faster way to do this, or some other method > that you all would recommend. > > Brandon McGinty
Is it traditional mbox, or the more recent mbox that uses a Content-length header? Either way, you could probably read the mbox files line by line, and yield a string corresponding to one message - one message at a time. Traditional mbox is easier - you just look for lines that start with "^From " - if a message actually wanted to include that in its body, the MTA should prepend it with a > or something to avoid ambiguity. With the Content-length header, you need to understand a little more about the header lines - this header gives the length of the message so that you don't need the ugly > escape for From's. -- http://mail.python.org/mailman/listinfo/python-list