[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

Steffen Daode Nurpmeso Sat, 26 Feb 2011 04:42:57 -0800

Steffen Daode Nurpmeso <sdao...@googlemail.com> added the comment:

On Fri, Feb 25, 2011 at 03:43:06PM +0000, Marc-Andre Lemburg wrote:
> 
> Marc-Andre Lemburg <m...@egenix.com> added the comment:
> 
> r88586: Normalized the encoding names for Latin-1 and UTF-8 to
> 'latin-1' and 'utf-8' in the stdlib.

Even though - or maybe exactly because - i'm a newbie, i really 
want to add another message after all this biting is over. 
I've just read PEP 100 and msg129257 (on Issue 5902), and i feel 
a bit confused.

> Marc-Andre Lemburg <m...@egenix.com> added the comment:
> It turns out that there are three "normalize" functions that are 
> successively applied to the encoding name during evaluation of 
> str.encode/str.decode.
> 
> 1. normalize_encoding() in unicodeobject.c
>
> This was added to have the few shortcuts we have in the C code
> for commonly used codecs match more encoding aliases.
>
> The shortcuts completely bypass the codec registry and also
> bypass the function call overhead incurred by codecs
> run via the codec registry.

The thing that i don't understand the most is that illegal 
(according to IANA standarts) names are good on the one hand 
(latin-1, utf-16-be), but bad on the other, i.e. in my 
group-preserving code or haypos very fast but name-joining patch 
(the first): a *local* change in unicodeobject.c, which' result is 
*only* used for the two users PyUnicode_Decode() and 
PyUnicode_AsEncodedString().  However:

> Marc-Andre Lemburg <m...@egenix.com> added the comment:
> Programmers who don't use the encoding names triggering those
> optimizations will still have a running program, it'll only be
> a bit slower and that's perfectly fine.

> Marc-Andre Lemburg <m...@egenix.com> added the comment:
> think rather than removing any hyphens, spaces, etc. the
> function should additionally:
>
>  * add hyphens whenever (they are missing and) there's switch
>     from [a-z] to [0-9]
>
> That way you end up with the correct names for the given set 
> of optimized encoding names.

haypos patch can easily be adjusted to reflect this, resulting in 
a much cleaner code in the two mentioned users, because 
normalize_encoding() did the job it was ment for. 
(Hmmm, and my own code could also be adjusted to match Python 
semantics (using hyphen instead of space as a group-separator), 
so that an end-user has the choice in between *all* IANA standart 
names (e.g. "ISO-8859-1", "ISO8859-1", "ISO_8859-1", "LATIN1"), 
and would gain the full optimization benefit of using latin-1, 
which seems to be pretty useful for limburger.)

> Ezio Melotti wrote:
> Marc-Andre Lemburg wrote:
>> That won't work, Victor, since it makes invalid encoding
>> names valid, e.g. 'utf(=)-8'.
>
> That already works in Python (thanks to encodings.normalize_encoding)

*However*: in PEP 100 Python has decided to go its own way 
a decade ago.

> Marc-Andre Lemburg <m...@egenix.com> added the comment:
> 2. normalizestring() in codecs.c
>
> This is the normalization applied by the codec registry. See PEP 100
> for details:
>
> """
>    Search functions are expected to take one argument, 
>    the encoding name in all lower case letters and with hyphens 
>    and spaces converted to underscores, ...
> """

> 3. normalize_encoding() in encodings/__init__.py
>
> This is part of the stdlib encodings package's codec search function.

First: *i* go for haypo:

> It's funny: we have 3 functions to normalize an encoding name, and
> each function does something else :-)

(that's Issue 11322:)
> We should first implement the same algorithm of the 3 normalization
> functions and add tests for them

And *i* don't understand anything else (*i* do have *my* - now 
furtherly optimized, thanks - s_textcodec_normalize_name()). 
However, two different ones (very fast thing which is enough to 
meet unicodeobject.c and a global one for anything else) may also do.
Isn't anything else a maintenance mess?  Where is that database, 
are there any known dependencies which are exposed to end-users?
Or the like.

I'm much too loud, and have a nice weekend.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue11303>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

Reply via email to