[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

STINNER Victor Thu, 24 Feb 2011 15:56:29 -0800

STINNER Victor <victor.stin...@haypocalc.com> added the comment:

>> That won't work, Victor, since it makes invalid encoding
>> names valid, e.g. 'utf(=)-8'.


> .. but this *is* valid: ...

Ah yes, it's because of encodings.normalize_encoding(). It's funny: we have 3 
functions to normalize an encoding name, and each function does something else 
:-) E.g. encodings.normalize_encoding() doesn't replace non-ASCII letters, and 
don't convert to lowercase.

more_aggressive_normalization.patch changes all of the 3 normalization 
functions and add tests on encodings.normalize_encoding().

I think that speed and backward compatibility is more important than conforming 
to IANA or other standards.

Even if "~~ utf#8 ~~" is ugly, I don't think that it really matter that we 
accept it.

--

If you don't want to touch the normalization functions and just add more 
aliases in C fast-paths: we should also add utf8, utf16 and utf32.

Use of "utf8" in Python: random.Random.seed(), 
smtpd.SMTPChannel.collect_incoming_data(), tarfile, multiprocessing.connection 
(xml serialization)

PS: On error, UTF-8 decoder raises a UnicodeDecodeError with "utf8" as the 
encoding name :-)

----------
Added file: http://bugs.python.org/file20880/more_aggressive_normalization.patch

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue11303>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

Reply via email to