On Sun, Aug 31, 2014 at 11:45 AM, Tim Chase <python.l...@tim.thechases.com> wrote: > Tinkering around with a little script, I found myself with the need > to walk a directory tree and process mail messaged found within. > Sometimes these end up being mbox files (with multiple messages > within), sometimes it's a Maildir structure with messages in each > individual file and extra holding directories, and sometimes it's a > MH directory. To complicate matters, there's also the possibility of > non-{mbox,maildir,mh) files such as binary MUA caches appearing > alongside these messages. > > Python knows how to handle each just fine as long as I tell it what > type of file to expect. But is there a straight-forward way to > distinguish them? (FWIW, the *nix "file" utility is just reporting > "ASCII text", sometimes "with very long lines", and sometimes > erroneously flags them as C or C++ files‽). > > All I need is "is it maildir, mbox, mh, or something else" (I don't > have to get more complex for the "something else") inside an os.walk > loop.
If you find a directory full of numbered files (and optionally, numbered filenames preceded by commas), that's probably an MH folder. I don't like regexes that much, but I'd probably use one for this. If you find a directory full of Maildir-style files, that's probably Maildir. You could probably match this with a regex too. If you find a file with lots of '^From " in it, that's probably an mbox file. However, you could have an mbox file with only one '^From ', so watch out. This will probably give some false postives and/or false negatives, depending on your data, but perhaps it's easier than classifying things manually. -- https://mail.python.org/mailman/listinfo/python-list