[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

John Machin Tue, 30 Mar 2010 19:29:32 -0700

New submission from John Machin <[email protected]>:

Unicode 5.2.0 chapter 3 (Conformance) has a new section (headed "Constraints on 
Conversion Processes) after requirement D93. Recent Pythons e.g. 3.1.2 don't 
comply. Using the Unicode example:


 >>> print(ascii(b"\xc2\x41\x42".decode('utf8', 'replace')))
 '\ufffdB'
 # should produce u'\ufffdAB'

Resynchronisation currently starts at a position derived by considering the 
length implied by the start byte:

 >>> print(ascii(b"\xf1ABCD".decode('utf8', 'replace')))
 '\ufffdD'
 # should produce u'\ufffdABCD'; resync should start from the *failing* byte.

Notes: This applies to the 'ignore' option as well as the 'replace' option. The 
Unicode discussion mentions "security exploits".

----------
messages: 101972
nosy: sjmachin
severity: normal
status: open
title: str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0
type: behavior
versions: Python 2.7, Python 3.1

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue8271>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

Reply via email to