Bugs item #1324237, was opened at 2005-10-11 23:35 Message generated for change (Comment added) made by lemburg You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1324237&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Unicode Group: Python 2.4 Status: Open Resolution: None Priority: 5 Submitted By: Eray Ozkural (exa) Assigned to: M.-A. Lemburg (lemburg) Summary: ISO8859-9 broken Initial Comment: Probably not limited to ISO8859-9. The problem is that the encodings returned by getlocale() and getpreferredencoding() are not guaranteed to work with, say, encode method of string. I'm on MDK10.2 and i switch to Turkish locale >>> locale.setlocale(locale.LC_ALL, '') 'tr_TR' There is nothing in sys.stdout.encoding! >>> sys.stdout.encoding >>> So I take a look at the encoding: >>> locale.getlocale() ['tr_TR', 'ISO8859-9'] >>> locale.getpreferredencoding() 'ISO-8859-9' Too bad I cannot use either encoding to encode innocent unicode strings >>> a = unicode('André','latin-1') >>> print a.encode(locale.getpreferredencoding()) Traceback (most recent call last): File "<stdin>", line 1, in ? LookupError: unknown encoding: ISO-8859-9 >>> print a.encode(locale.getlocale()[1]) Traceback (most recent call last): File "<stdin>", line 1, in ? LookupError: unknown encoding: ISO8859-9 So I take a look at python page and I see that all encoding names are in lowercase. That's no good, because: >>> locale.getpreferredencoding().lower() '\xfdso-8859-9' (see bug 1193061 ) So I have to do this by hand! But of course this is unacceptable for any locale aware application. >>> print a.encode('iso-8859-9') André Expected: 1. I expect the encoding string returned by getpreferredencoding and getlocale to be *identical* 2. I expect the encoding string returned to *work* with encode method and in general *any* function that accepts locales. Got: 1. Different, ad hoc strings 2. Not all aliases present, only lowercases present, no reliable way to find a canonical locale name. Recommendations: a. Please consider the Java-like solution to make Locale into a class or an enum, something reliable, rather than just a string. b. Please test the locale functions in locales other than US (that is not really a locale anyway) ---------------------------------------------------------------------- >Comment By: M.-A. Lemburg (lemburg) Date: 2005-10-21 16:25 Message: Logged In: YES user_id=38388 SF has problems again it seems... Anyway, I tried to set the TR_tr locale on my system and got a surprising result: >>> import locale >>> locale.setlocale(locale.LC_ALL, 'tr_TR') 'tr_TR' >>> locale.getpreferredencoding().lower() 'ans\xfd_x3.4-1968' >>> locale.getpreferredencoding() 'ANSI_X3.4-1968' So I think the problem lies with the fact that string.lower() is locale dependent and the GLIBC folks chose a highly incompatible way of dealing with the special Turkish situation of the capital "I" mapping to lower-case. While this kind of mapping may make sense for text processing in applications it certainly does not make sense when dealing with programming code or things that need to be specified in plain ASCII. In short: the encoding used for the TR_tr locale is not ASCII-compatible and thus not suitable for Python source code. I'm not sure what to say to this. My only advice is to *not* set the global locale setting to TR_tr, but only do this when it comes to actually processsing text in an application. Alternatively, you could write you application text using Unicode and the use the ISO-8859-9 codec to encode it for I/O. ---------------------------------------------------------------------- Comment By: M.-A. Lemburg (lemburg) Date: 2005-10-21 16:18 Message: Logged In: YES user_id=38388 Something in your installation must be broken: it seems the system cannot find the ISO-8859-9 codec. Note that the .encode() method uses the codec registry for the lookup of the codec. The lookup itself is done case-insensitive and subject to a few other normalizations (see encodings/__init__.py). Please check your system and then report back whether you still see the reported error. Thanks. ---------------------------------------------------------------------- Comment By: M.-A. Lemburg (lemburg) Date: 2005-10-21 16:12 Message: Logged In: YES user_id=38388 Something in your installation must be broken: it seems the system cannot find the ISO-8859-9 codec. Note that the .encode() method uses the codec registry for the lookup of the codec. The lookup itself is done case-insensitive and subject to a few other normalizations (see encodings/__init__.py). Please check your system and then report back whether you still see the reported error. Thanks. ---------------------------------------------------------------------- Comment By: Eray Ozkural (exa) Date: 2005-10-11 23:46 Message: Logged In: YES user_id=1454 BTW, I put this into Unicode category, because the bugs in it seemed relevant to localization. Thank you very much for your consideration. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1324237&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com