Steffen Daode Nurpmeso <sdao...@googlemail.com> added the comment: On Fri, Feb 25, 2011 at 03:43:06PM +0000, Marc-Andre Lemburg wrote: > > Marc-Andre Lemburg <m...@egenix.com> added the comment: > > r88586: Normalized the encoding names for Latin-1 and UTF-8 to > 'latin-1' and 'utf-8' in the stdlib.
Even though - or maybe exactly because - i'm a newbie, i really want to add another message after all this biting is over. I've just read PEP 100 and msg129257 (on Issue 5902), and i feel a bit confused. > Marc-Andre Lemburg <m...@egenix.com> added the comment: > It turns out that there are three "normalize" functions that are > successively applied to the encoding name during evaluation of > str.encode/str.decode. > > 1. normalize_encoding() in unicodeobject.c > > This was added to have the few shortcuts we have in the C code > for commonly used codecs match more encoding aliases. > > The shortcuts completely bypass the codec registry and also > bypass the function call overhead incurred by codecs > run via the codec registry. The thing that i don't understand the most is that illegal (according to IANA standarts) names are good on the one hand (latin-1, utf-16-be), but bad on the other, i.e. in my group-preserving code or haypos very fast but name-joining patch (the first): a *local* change in unicodeobject.c, which' result is *only* used for the two users PyUnicode_Decode() and PyUnicode_AsEncodedString(). However: > Marc-Andre Lemburg <m...@egenix.com> added the comment: > Programmers who don't use the encoding names triggering those > optimizations will still have a running program, it'll only be > a bit slower and that's perfectly fine. > Marc-Andre Lemburg <m...@egenix.com> added the comment: > think rather than removing any hyphens, spaces, etc. the > function should additionally: > > * add hyphens whenever (they are missing and) there's switch > from [a-z] to [0-9] > > That way you end up with the correct names for the given set > of optimized encoding names. haypos patch can easily be adjusted to reflect this, resulting in a much cleaner code in the two mentioned users, because normalize_encoding() did the job it was ment for. (Hmmm, and my own code could also be adjusted to match Python semantics (using hyphen instead of space as a group-separator), so that an end-user has the choice in between *all* IANA standart names (e.g. "ISO-8859-1", "ISO8859-1", "ISO_8859-1", "LATIN1"), and would gain the full optimization benefit of using latin-1, which seems to be pretty useful for limburger.) > Ezio Melotti wrote: > Marc-Andre Lemburg wrote: >> That won't work, Victor, since it makes invalid encoding >> names valid, e.g. 'utf(=)-8'. > > That already works in Python (thanks to encodings.normalize_encoding) *However*: in PEP 100 Python has decided to go its own way a decade ago. > Marc-Andre Lemburg <m...@egenix.com> added the comment: > 2. normalizestring() in codecs.c > > This is the normalization applied by the codec registry. See PEP 100 > for details: > > """ > Search functions are expected to take one argument, > the encoding name in all lower case letters and with hyphens > and spaces converted to underscores, ... > """ > 3. normalize_encoding() in encodings/__init__.py > > This is part of the stdlib encodings package's codec search function. First: *i* go for haypo: > It's funny: we have 3 functions to normalize an encoding name, and > each function does something else :-) (that's Issue 11322:) > We should first implement the same algorithm of the 3 normalization > functions and add tests for them And *i* don't understand anything else (*i* do have *my* - now furtherly optimized, thanks - s_textcodec_normalize_name()). However, two different ones (very fast thing which is enough to meet unicodeobject.c and a global one for anything else) may also do. Isn't anything else a maintenance mess? Where is that database, are there any known dependencies which are exposed to end-users? Or the like. I'm much too loud, and have a nice weekend. ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue11303> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com