Daniel Lenski <dlen...@gmail.com> added the comment:
Due to this bug, any user of this function in Python 3.0+ *already* has to be able to handle all of the following outputs in order to use it reliably: decode_header(...) -> [(str, None)] or decode_header(...) -> [(bytes, str)] or decode_header(...) -> [(bytes, (str|None)), (bytes, (str|None)), ...] == Fix str/bytes inconsistency == We could eliminate the inconsistency, and make the function only ever return bytes instead of str, with the following changes to https://github.com/python/cpython/blob/3.10/Lib/email/header.py. ``` diff --git a/Lib/email/header.py.orig b/Lib/email/header.py index 4ab0032..41e91f2 100644 --- a/Lib/email/header.py +++ b/Lib/email/header.py @@ -61,7 +61,7 @@ _max_append = email.quoprimime._max_append def decode_header(header): """Decode a message header value without converting charset. - Returns a list of (string, charset) pairs containing each of the decoded + Returns a list of (bytes, charset) pairs containing each of the decoded parts of the header. Charset is None for non-encoded parts of the header, otherwise a lower-case string containing the name of the character set specified in the encoded string. @@ -78,7 +78,7 @@ def decode_header(header): for string, charset in header._chunks] # If no encoding, just return the header with no charset. if not ecre.search(header): - return [(header, None)] + return [header.encode(), None)] # First step is to parse all the encoded parts into triplets of the form # (encoded_string, encoding, charset). For unencoded strings, the last # two parts will be None. ``` With these changes, decode_header() would return one of the following: decode_header(...) -> [(bytes, None)] or decode_header(...) -> [(bytes, str)] or decode_header(...) -> [(bytes, (str|None)), (bytes, (str|None)), ...] == Ensure that charset is always str, never None == A couple more small changes: ``` @@ -92,7 +92,7 @@ def decode_header(header): unencoded = unencoded.lstrip() first = False if unencoded: - words.append((unencoded, None, None)) + words.append((unencoded, None, 'ascii')) if parts: charset = parts.pop(0).lower() encoding = parts.pop(0).lower() @@ -133,7 +133,8 @@ def decode_header(header): # Now convert all words to bytes and collapse consecutive runs of # similarly encoded words. collapsed = [] - last_word = last_charset = None + last_word = None + last_charset = 'ascii' for word, charset in decoded_words: if isinstance(word, str): word = bytes(word, 'raw-unicode-escape') ``` With these changes, decode_header() would return only: decode_header(...) -> List[(bytes, str)] ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue22833> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com