New submission from STINNER Victor <victor.stin...@haypocalc.com>:

It would be nice to support PEP 383 (surrogateescape) on Windows, but the mbcs 
codec doesn't support it for performance reason. The Windows functions to 
encode/decode MBCS don't give the index of the unencodable/undecodable 
character/byte. For encoding, we can try to encode character by character (but 
be careful of surrogate pairs) and check that the character is a Python lone 
surrogate character or not (character in range U+DC80..U+DCFF). For decoding, 
it is more complex because MBCS can be a multibyte encoding, eg. cp65001 
(Microsoft variant of utf-8, see #6058). So it's not possible to encode byte 
per byte and we should write an heuristic to guess the right number of bytes 
for each call to the decode function.

--

A completly different solution is to get the MBCS code page and use the Python 
code page codec (eg. "cp1252") instead of "mbcs" encoding, because Python 
cpXXXX codecs support all Python error handlers. Example (with Python 2.6):

>>> print(u"abcŁdef".encode("cp1252", "replace"))
abc?def
>>> print(u"abcŁdef".encode("cp1252", "ignore"))
abcdef
>>> print(u"abcŁdef".encode("cp1252", "backslashreplace"))
abc\u0141def

See also #8611 for the problem if the Python path cannot be encoded to mbcs 
(work in progress, see #9425).

----------
components: Interpreter Core, Library (Lib), Unicode, Windows
messages: 116001
nosy: haypo, loewis
priority: normal
severity: normal
status: open
title: Support PEP 383 on Windows: mbcs support of surrogateescape error handler
versions: Python 3.2

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue9821>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to