Marc-Andre Lemburg <m...@egenix.com> added the comment: STINNER Victor wrote: > > New submission from STINNER Victor <victor.stin...@haypocalc.com>: > > PyUnicode_Decode() and PyUnicode_AsEncodedString() calls directly builtin > decoders/encoders for some known encodings (eg. "utf-8"), instead of using > the slow path (call PyCodec_Decode() / PyCodec_Encode()). > > PyUnicode_Decode() does normalize the encoding name: convert to lower and > replace "_" by "-", as normalizestring() does. But > PyUnicode_AsEncodedString() doesn't normalize the encoding name, it just use > strcmp(). PyUnicode_Decode() has a shortcut for ISO-8859-1, whereas > PyUnicode_AsEncodedString() doesn't (only for "latin-1"). > > Attached patch creates a subfunction (static) normalize_encoding(), use it in > PyUnicode_Decode() and PyUnicode_AsEncodedString(), and adds a shortcut for > ISO-8859-1 to PyUnicode_AsEncodedString().
The normalization in PyUnicode_Decode() must have been added to Python3 only. It is not present in Python2. I'm not sure whether it's a good idea to extend this further: the shortcuts were meant for Python internal use only. Python itself and it's stdlib should only use the shortcut names for the resp. special encodings and no variants. Dealing with variants and normalization is left to the encodings package and its alias machinery. Since the Python stdlib and the core already mostly use the shortcut names, adding normalization won't buy us much. Note that your change has also made it impossible for the compiler to do loop unrolling - there's not upper limit on the size of lower anymore. In terms of coding style, "static" should go on a separate line. ---------- nosy: +lemburg _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue8922> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com