[issue12281] bytes.decode('mbcs', 'ignore') does replace undecodable bytes on Windows Vista or later

STINNER Victor Wed, 08 Jun 2011 05:48:47 -0700

STINNER Victor <victor.stin...@haypocalc.com> added the comment:

mbcs.patch fixes PyUnicode_DecodeMBCS():
 - only use flags=0 if errors="replace" on Windows >= Vista or if 
errors="ignore" on Windows < Vista
 - support any error handler
 - support any code page (but the code page is hardcoded to CP_ACP)


My patch always tries to decode in strict mode. On decode error: it decodes 
byte per byte, and call unicode_decode_call_errorhandler() on error.

TODO:

 - don't use insize=1 (decode byte per byte): it doesn't work with multibyte 
encodings (like UTF-8)
 - use final in decode_mbcs_errors(): a multibyte character may be splitted 
between two chunks of INT_MAX bytes
 - fix all FIXME
 - patch also PyUnicode_EncodeMBCS()
 - implement suggested Martin's optimizations?
 - MB_ERR_INVALID_CHARS is not supported by some code pages (e.g. UTF-7 code 
page)

Is it necessary to write a NUL character at the end? ("*out = 0;")

It would be nice to support any code page, and maybe support more options (e.g. 
MB_COMPOSITE, MB_PRECOMPOSED, MB_USEGLYPHCHARS to decode).

It is possible to test different code pages by changing the hardcoded code_page 
value in PyUnicode_DecodeMBCS. Change your region in the control panel if you 
would like to change the Windows ANSI code page. You can also play with 
SetThreadLocale() and CP_THREAD_ACP to test the ANSI code page of the current 
thread.

----------
keywords: +patch
Added file: http://bugs.python.org/file22282/mbcs.patch

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue12281>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12281] bytes.decode('mbcs', 'ignore') does replace undecodable bytes on Windows Vista or later

Reply via email to