Michael wrote: > I'm trying to import text from email I've received, run some regular > expressions on it, and save > the text into a database. I'm trying to figure out how to handle the issue of > character sets. I've > had some problems with my regular expressions on email that has interesting > character sets. Korean > text seems to be filled with a lot of '=3D=21' type of stuff.
looks like http://python.org/doc/lib/module-quopri.html plus perhaps some encoding. instead of rolling your own message handling code, consider using this package instead: http://python.org/doc/lib/module-email.html in either case, the MIME specification is required reading here (for a link, see the quopri page above). > Do I need to do anything special when passing text with non-ascii > characters to re depends on your patterns. by default, RE operators like \w and \s assume ASCII. to use other encodings, use the (?u) flag and convert your text to Unicode before passing it to the RE module. > Is it better to save the text as-is in my db and save the character set type > too or should I try to convert all text to some default format like UTF-8? depends on your application; using a standard encoding has many advantages, but storing the original text "as is" guarantees that no information is lost, even if you have bugs in your conversion code. when in doubt, save the original and do the conversion on the way out. </F> -- http://mail.python.org/mailman/listinfo/python-list