New submission from Merlijn van Deen:

Bugzilla sends e-mail in a format where =?UTF-8 is not preceded by whitespace. 
This makes email.headerregistry.UnstructuredHeader (and 
email._header_value_parser on the background) not recognise the structure.

>>> import email.headerregistry, pprint
>>> x = {}; email.headerregistry.UnstructuredHeader.parse('[Bug 
>>> 64155]\tNew:=?UTF-8?Q?=20non=2Dascii=20bug=20t=C3=A9st?=;\trussian 
>>> text:=?UTF-8?Q?=20=D0=90=D0=91=D0=92=D0=93=D2=90=D0=94', x); 
>>> pprint.pprint(x)
{'decoded': '[Bug 64155]\tNew:=?UTF-8?Q?=20non=2Dascii=20bug=20t=C3=A9st?=;\t'
            'russian text:=?UTF-8?Q?=20=D0=90=D0=91=D0=92=D0=93=D2=90=D0=94',
 'parse_tree': UnstructuredTokenList([ValueTerminal('[Bug'), 
WhiteSpaceTerminal(' '), ValueTerminal('64155]'), WhiteSpaceTerminal('\t'), 
ValueTerminal('New:=?UTF-8?Q?=20non=2Dascii=20bug=20t=C3=A9st?=;'), 
WhiteSpaceTerminal('\t'), ValueTerminal('russian'), WhiteSpaceTerminal(' '), 
ValueTerminal('text:=?UTF-8?Q?=20=D0=90=D0=91=D0=92=D0=93=D2=90=D0=94')])}

versus

>>> x = {}; email.headerregistry.UnstructuredHeader.parse('[Bug 64155]\tNew: 
>>> =?UTF-8?Q?=20non=2Dascii=20bug=20t=C3=A9st?=;\trussian text: 
>>> =?UTF-8?Q?=20=D0=90=D0=91=D0=92=D0=93=D2=90=D0=94', x); pprint.pprint(x)
{'decoded': '[Bug 64155]\tNew:  non-ascii bug tést;\trussian text:  АБВГҐД',
 'parse_tree': UnstructuredTokenList([ValueTerminal('[Bug'), 
WhiteSpaceTerminal(' '), ValueTerminal('64155]'), WhiteSpaceTerminal('\t'), 
ValueTerminal('New:'), WhiteSpaceTerminal(' '), 
EncodedWord([WhiteSpaceTerminal(' '), ValueTerminal('non-ascii'), 
WhiteSpaceTerminal(' '), ValueTerminal('bug'), WhiteSpaceTerminal(' '), 
ValueTerminal('tést')]), ValueTerminal(';'), WhiteSpaceTerminal('\t'), 
ValueTerminal('russian'), WhiteSpaceTerminal(' '), ValueTerminal('text:'), 
WhiteSpaceTerminal(' '), EncodedWord([WhiteSpaceTerminal(' '), 
ValueTerminal('АБВГҐД')])])}

I have attached the raw e-mail as attachment.


Judging by the code, this is supposed to work (while raising a Defect --  
"missing whitespace before encoded word"), but the code splits by whitespace:

tok, *remainder = _wsp_splitter(value, 1)

which swallows the encoded section in one go. In a second attachment, I added a 
patch which 1) adds a test case for this and 2) implements a solution, but the 
solution is unfortunately not in the style of the rest of the module.


In the meanwhile, I've chosen a monkey-patching approach to work around the 
issue:

import email._header_value_parser, email.headerregistry
def get_unstructured(value):
    value = value.replace("=?UTF-8?Q?=20", " =?UTF-8?Q?")
    return email._header_value_parser.get_unstructured(value)
email.headerregistry.UnstructuredHeader.value_parser = 
staticmethod(get_unstructured)

----------
components: email
files: 000359.raw
messages: 216908
nosy: barry, r.david.murray, valhallasw
priority: normal
severity: normal
status: open
title: email._header_value_parser does not recognise in-line encoding changes
versions: Python 3.3, Python 3.4, Python 3.5
Added file: http://bugs.python.org/file34984/000359.raw

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue21315>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to