New submission from era <era+pyt...@iki.fi>:

https://github.com/python/cpython/blob/3.7/Lib/email/contentmanager.py#L64 
currently contains the following code:

    def get_text_content(msg, errors='replace'):
        content = msg.get_payload(decode=True)
        charset = msg.get_param('charset', 'ASCII')
        return content.decode(charset, errors=errors)

This breaks when the IANA character set is not identical to the Python encoding 
name. For example, pass it a message with

    Content-type: text/plain; charset=cp-850

This breaks for two separate reasons (and I will report two separate bugs); the 
IANA character-set label should be looked up and converted to a Python codec 
name (that's this bug) and the character-set alias 'cp-850' is not defined in 
the lookup table in the place.

There are probably other places in contentmanager.py where a similar mapping 
should take place. 

I do not have a proper patch, but in general outline, the fix would look like

+    import email.charset
+
     def get_text_content(msg, errors='replace'):
        content = msg.get_payload(decode=True)
        charset = msg.get_param('charset', 'ASCII')
-       return content.decode(charset, errors=errors)
+       encoding = Charset(charset).output_charset()
+       return content.decode(encoding, errors=errors)

This was discovered in this Stack Overflow post: 
https://stackoverflow.com/a/51961225/874188

----------
components: email
messages: 323869
nosy: barry, era, r.david.murray
priority: normal
severity: normal
status: open
title: email.contentmanager should use IANA encoding
versions: Python 3.7

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue34459>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to