Serhiy Storchaka added the comment:

There is a reason for behavior in case 2. This is likely a truncated data and 
it is safer to raise an exception than silently produce lone surrogate. Current 
UTF-7 encoder always adds '-' after ending shift sequence. I suppose this is 
not a bug.

However there are yet three bugs.

4. Decoder can emit lone low surrogate before replacement character in case of 
error.

>>> b'+2DTdI-'.decode('utf-7', 'replace')
'\ud834�'

A low surrogate is a part of incomplete astral character and shouldn't emitted 
in case of error in encoded astral character.

5. According to RFC 2152: "A "+" character followed immediately by any 
character other than members of set B or "-" is an ill-formed sequence." But 
this is accepted by current decoder as empty shift sequence that is decoded to 
empty string.

>>> b'a+,b'.decode('utf-7')
'a,b'
>>> b'a+'.decode('utf-7')
'a'

6. Replacement character '\ufffd' can be replaced with character 'ý' ('\xfd'):

>>> b'\xff'.decode('utf-7', 'replace')
'�'
>>> b'a\xff'.decode('utf-7', 'replace')
'a�'
>>> b'a\xffb'.decode('utf-7', 'replace')
'a�b'
>>> b'\xffb'.decode('utf-7', 'replace')
'ýb'

This bug is reproduced only in 3.4+.

Following patch fixes bugs 1 and 4 and adds more tests.

Corner cases 2 and 3 are likely not bugs.

I doubt about fixing bug 5. iconv accepts such ill-formed sequences. In any 
case I think the fix of this bug can be applied only for default branch.

I have no idea how to fix bug 6. I afraid it can be a bug in _PyUnicodeWriter 
and therefore can affect other decoders.

----------
keywords: +patch
stage:  -> patch review
Added file: http://bugs.python.org/file40223/utf7_error_handling.patch

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue24848>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to