Re: Why are some unicode error handlers "encode only"?

Terry Reedy Sun, 11 Mar 2012 10:14:12 -0700

On 3/11/2012 10:37 AM, Steven D'Aprano wrote:

At least two standard error handlers are documented as working for
encoding only:


xmlcharrefreplace
backslashreplace

See http://docs.python.org/library/codecs.html#codec-base-classes

and http://docs.python.org/py3k/library/codecs.html

Why is this?

I presume the purpose of both is to facilitate transmission of unicodetext via byte transmission by extending incomplete byte encodings byreplacing unicode chars that do not fit in the given encoding by a asciibyte sequence that will fit.

I don't see why they shouldn't work for decoding as well.
Consider this example using Python 3.2:

b"aaa--\xe9z--\xe9!--bbb".decode("cp932")

Traceback (most recent call last):
   File "<stdin>", line 1, in<module>
UnicodeDecodeError: 'cp932' codec can't decode bytes in position 9-10:
illegal multibyte sequence

The two bytes b'\xe9!' is an illegal multibyte sequence for CP-932 (also
known as MS-KANJI or SHIFT-JIS). Is there some reason why this shouldn't
or can't be supported?

# This doesn't actually work.
b"aaa--\xe9z--\xe9!--bbb".decode("cp932", "backslashreplace")
=>  r'aaa--騷--\xe9\x21--bbb'

This output does not round-trip and would be a bit of a fib since itsomewhat misrepresents what the encoded bytes were:


>>> r'aaa--騷--\xe9\x21--bbb'.encode("cp932")
b'aaa--\xe9z--\\xe9\\x21--bbb'
>>> b'aaa--\xe9z--\\xe9\\x21--bbb'.decode("cp932")
'aaa--騷--\\xe9\\x21--bbb'

Python 3 added surrogateescape error handling to solve this problem.

and similarly for xmlcharrefreplace.

Since xml character references are representations of unicode chars, andnot bytes, I do not see how that would work. By analogy, perhaps youmean to have '&#e9;' in your output instead of '\xe9\x21', butthose would not properly be xml numeric character references.


--
Terry Jan Reedy


--
http://mail.python.org/mailman/listinfo/python-list

Re: Why are some unicode error handlers "encode only"?

Reply via email to