New submission from Merlijn van Deen: Bugzilla sends e-mail in a format where =?UTF-8 is not preceded by whitespace. This makes email.headerregistry.UnstructuredHeader (and email._header_value_parser on the background) not recognise the structure.
>>> import email.headerregistry, pprint >>> x = {}; email.headerregistry.UnstructuredHeader.parse('[Bug >>> 64155]\tNew:=?UTF-8?Q?=20non=2Dascii=20bug=20t=C3=A9st?=;\trussian >>> text:=?UTF-8?Q?=20=D0=90=D0=91=D0=92=D0=93=D2=90=D0=94', x); >>> pprint.pprint(x) {'decoded': '[Bug 64155]\tNew:=?UTF-8?Q?=20non=2Dascii=20bug=20t=C3=A9st?=;\t' 'russian text:=?UTF-8?Q?=20=D0=90=D0=91=D0=92=D0=93=D2=90=D0=94', 'parse_tree': UnstructuredTokenList([ValueTerminal('[Bug'), WhiteSpaceTerminal(' '), ValueTerminal('64155]'), WhiteSpaceTerminal('\t'), ValueTerminal('New:=?UTF-8?Q?=20non=2Dascii=20bug=20t=C3=A9st?=;'), WhiteSpaceTerminal('\t'), ValueTerminal('russian'), WhiteSpaceTerminal(' '), ValueTerminal('text:=?UTF-8?Q?=20=D0=90=D0=91=D0=92=D0=93=D2=90=D0=94')])} versus >>> x = {}; email.headerregistry.UnstructuredHeader.parse('[Bug 64155]\tNew: >>> =?UTF-8?Q?=20non=2Dascii=20bug=20t=C3=A9st?=;\trussian text: >>> =?UTF-8?Q?=20=D0=90=D0=91=D0=92=D0=93=D2=90=D0=94', x); pprint.pprint(x) {'decoded': '[Bug 64155]\tNew: non-ascii bug tést;\trussian text: АБВГҐД', 'parse_tree': UnstructuredTokenList([ValueTerminal('[Bug'), WhiteSpaceTerminal(' '), ValueTerminal('64155]'), WhiteSpaceTerminal('\t'), ValueTerminal('New:'), WhiteSpaceTerminal(' '), EncodedWord([WhiteSpaceTerminal(' '), ValueTerminal('non-ascii'), WhiteSpaceTerminal(' '), ValueTerminal('bug'), WhiteSpaceTerminal(' '), ValueTerminal('tést')]), ValueTerminal(';'), WhiteSpaceTerminal('\t'), ValueTerminal('russian'), WhiteSpaceTerminal(' '), ValueTerminal('text:'), WhiteSpaceTerminal(' '), EncodedWord([WhiteSpaceTerminal(' '), ValueTerminal('АБВГҐД')])])} I have attached the raw e-mail as attachment. Judging by the code, this is supposed to work (while raising a Defect -- "missing whitespace before encoded word"), but the code splits by whitespace: tok, *remainder = _wsp_splitter(value, 1) which swallows the encoded section in one go. In a second attachment, I added a patch which 1) adds a test case for this and 2) implements a solution, but the solution is unfortunately not in the style of the rest of the module. In the meanwhile, I've chosen a monkey-patching approach to work around the issue: import email._header_value_parser, email.headerregistry def get_unstructured(value): value = value.replace("=?UTF-8?Q?=20", " =?UTF-8?Q?") return email._header_value_parser.get_unstructured(value) email.headerregistry.UnstructuredHeader.value_parser = staticmethod(get_unstructured) ---------- components: email files: 000359.raw messages: 216908 nosy: barry, r.david.murray, valhallasw priority: normal severity: normal status: open title: email._header_value_parser does not recognise in-line encoding changes versions: Python 3.3, Python 3.4, Python 3.5 Added file: http://bugs.python.org/file34984/000359.raw _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue21315> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com