[issue22833] The decode_header() function decodes raw part to bytes or str, depending on encoded part

Daniel Lenski Thu, 30 Dec 2021 15:08:34 -0800


Daniel Lenski <dlen...@gmail.com> added the comment:


Due to this bug, any user of this function in Python 3.0+ *already* has to be 
able to handle all of the following outputs in order to use it reliably:

   decode_header(...) -> [(str, None)]
or decode_header(...) -> [(bytes, str)]
or decode_header(...) -> [(bytes, (str|None)), (bytes, (str|None)), ...]

== Fix str/bytes inconsistency ==

We could eliminate the inconsistency, and make the function only ever return 
bytes instead of str, with the following changes to 
https://github.com/python/cpython/blob/3.10/Lib/email/header.py.

```
diff --git a/Lib/email/header.py.orig b/Lib/email/header.py
index 4ab0032..41e91f2 100644
--- a/Lib/email/header.py
+++ b/Lib/email/header.py
@@ -61,7 +61,7 @@ _max_append = email.quoprimime._max_append
 def decode_header(header):
     """Decode a message header value without converting charset.
 
-    Returns a list of (string, charset) pairs containing each of the decoded
+    Returns a list of (bytes, charset) pairs containing each of the decoded
     parts of the header.  Charset is None for non-encoded parts of the header,
     otherwise a lower-case string containing the name of the character set
     specified in the encoded string.
@@ -78,7 +78,7 @@ def decode_header(header):
                     for string, charset in header._chunks]
     # If no encoding, just return the header with no charset.
     if not ecre.search(header):
-        return [(header, None)]
+        return [header.encode(), None)]
     # First step is to parse all the encoded parts into triplets of the form
     # (encoded_string, encoding, charset).  For unencoded strings, the last
     # two parts will be None.
```

With these changes, decode_header() would return one of the following:

   decode_header(...) -> [(bytes, None)]
or decode_header(...) -> [(bytes, str)]
or decode_header(...) -> [(bytes, (str|None)), (bytes, (str|None)), ...]


== Ensure that charset is always str, never None ==

A couple more small changes:

```
@@ -92,7 +92,7 @@ def decode_header(header):
                 unencoded = unencoded.lstrip()
                 first = False
             if unencoded:
-                words.append((unencoded, None, None))
+                words.append((unencoded, None, 'ascii'))
             if parts:
                 charset = parts.pop(0).lower()
                 encoding = parts.pop(0).lower()
@@ -133,7 +133,8 @@ def decode_header(header):
     # Now convert all words to bytes and collapse consecutive runs of
     # similarly encoded words.
     collapsed = []
-    last_word = last_charset = None
+    last_word = None
+    last_charset = 'ascii'
     for word, charset in decoded_words:
         if isinstance(word, str):
             word = bytes(word, 'raw-unicode-escape')
```

With these changes, decode_header() would return only:

   decode_header(...) -> List[(bytes, str)]

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue22833>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue22833] The decode_header() function decodes raw part to bytes or str, depending on encoded part

Reply via email to