On Sun, Jan 02, 2022 at 01:23:38PM +0100, Jaroslaw Rafa <r...@rafa.eu.org> wrote:
> Dnia 1.01.2022 o godz. 11:40:37 Frank Hwa pisze: > > > > For a multipart message, is text/plain part always in the first > > location? > > I just want to extract the plain text body of a message. I use the > > code below (python), but was not very sure. > > I have a perl script that extracts all text/plain parts from multipart > messages, up to 5 levels nesting of multipart messages one inside another > (that level is configurable via a parameter in the script). > > If you want to look at it, it's here: http://rafa.eu.org/media/textconv.pl > -- > Regards, > Jaroslaw Rafa > r...@rafa.eu.org > -- > "In a million years, when kids go to school, they're gonna know: once there > was a Hushpuppy, and she lived with her daddy in the Bathtub." Another thing that might help is my "textmail" program which is a mail filter that converts non-text attachments into text attachments where possible (using external translation programs), and deletes attachments that can't be translated to text (like images). It replaces multipart/alternative parts with the text/plain part unless it looks vestigial, in which case it replaces them with the other alternative part converted to text. This is often much better than just grabbing the text/plain attachment, since it might just say something like "Your email client does not support HTML email". There are a few builtin tests to identify vestigial text/plain parts, and you can add new ones if necessary. It can also save attachments with particular mimetypes. A command like this does something like what you want: cat msg | textmail | textmail -F text/plain -G /path/for/attachments >/dev/null That performs the default transformations, then saves all resulting text/plain attachments to a directory, and discards the resulting mail message. https://raf.org/textmail https://github.com/raforg/textmail However, it requires multiple external processes (textmail/perl itself and the translators), and so probably only works on UNIX-like systems. If you need it to be pure Python, and aren't expecting any vestigial text/plain parts, you could modify your existing script to recursively examine all parts looking for text/plain. Something like this: def get_text_parts(msg): parts = [] if msg.is_multipart(): for part in msg.get_payload(): parts.extend(get_text_parts(part)) elif msg.get_content_type() == 'text/plain': parts.append(msg.get_payload()) return parts text_parts = get_text_parts(email.message_from_string(x)) print('%r' % text_parts) cheers, raf