Bugs item #1528802, was opened at 2006-07-26 09:05 Message generated for change (Comment added) made by sgala You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1528802&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Unicode Group: Python 2.4 Status: Open Resolution: None Priority: 6 Submitted By: Ahmet Bişkinler (ahmetbiskinler) Assigned to: M.-A. Lemburg (lemburg) Summary: Turkish Character Initial Comment: >>> print "Mayıs".upper() >>> MAYıS >>> import locale >>> locale.setlocale(locale.LC_ALL,'Turkish_Turkey.1254') >>> print "Mayıs".upper() >>> MAYıS >>> print "ğüşiöçı".upper() >>> ğüşIöçı MAYıS should be MAYIS ğüşIöçı should be ĞÜŞİÖÇI but >>> "Mayıs".upper() >>> "MAYIS" is right ---------------------------------------------------------------------- Comment By: Santiago Gala (sgala) Date: 2006-08-18 16:37 Message: Logged In: YES user_id=178886 Done: Bug #1542677 ---------------------------------------------------------------------- Comment By: Georg Brandl (gbrandl) Date: 2006-08-17 21:08 Message: Logged In: YES user_id=849994 Please submit that as a separate IDLE bug. ---------------------------------------------------------------------- Comment By: Santiago Gala (sgala) Date: 2006-08-17 20:58 Message: Logged In: YES user_id=178886 Idle from 2.5rc1 (svn today) produces a different result than console (with my default, utf-8, encoding): IDLE 1.2c1 >>> print "á" á >>> print len("á") 2 >>> print "á".upper() á >>> str("á") '\xc3\xa1' >>> print u"á" á >>> print len(u"á") 2 >>> print u"á".upper() á >>> str(u"á") Traceback (most recent call last): File "<pyshell#7>", line 1, in <module> str(u"á") UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128) Again, IDLE 1.1.3 (python 2.4.3) produces a different result: IDLE 1.1.3 >>> print "á" á >>> print len("á") 2 >>> print "á".upper() á >>> str("á") '\xc3\xa1' >>> print u"á" á >>> print len(u"á") 2 >>> print u"á".upper() á >>> str(u"á") '\xc3\x83\xc2\xa1' >>> I'd say idle is broken, as it is not able to respect utf-8 for print (or even len) of unicode strings. OTOH, with some tricks I can manage to get an accented a in a unicode in idle: >>> import unicodedata >>> print unicodedata.lookup("LATIN SMALL LETTER A WITH ACUTE") á >>> print len(unicodedata.lookup("LATIN SMALL LETTER A WITH ACUTE")) 1 ---------------------------------------------------------------------- Comment By: Georg Brandl (gbrandl) Date: 2006-08-17 17:08 Message: Logged In: YES user_id=849994 Using Unicode strings, the OP's example works. ---------------------------------------------------------------------- Comment By: M.-A. Lemburg (lemburg) Date: 2006-08-17 17:04 Message: Logged In: YES user_id=38388 String upper and lower conversion are locale dependent and implemented by the underlying libc, whereas Unicode upper/lower conversion is not and only depends on the Unicode character database. OTOH, there are special cases where the standard Unicode upper/lower mapping is no what you might expect, since the database only provides a single mapping and is not context aware. There's nothing we can do if the libc is broken in some respect. As for the extended case mapping support in Unicode: patches are welcome. ---------------------------------------------------------------------- Comment By: Georg Brandl (gbrandl) Date: 2006-08-17 17:03 Message: Logged In: YES user_id=849994 sgala: it looks like your console sends UTF-8 encoded text. >>> print "á" á print is just printing out a byte string consisting of two bytes, which your console displays as accent-a. >>> print len("á") 2 A UTF-8-encoded string containing an accented a has two bytes. >>> print "á".upper() á str.upper() doesn't take locale into account, so the accented a has no uppercase version defined. >>> str("á") '\xc3\xa1' str() applied to a byte string returns that byte string. Since return values from functions are printed by the interactive interpreter using repr() first, you get this representation (which you could also get from "print repr('a')".) >>> print u"á" á That's also okay. Python knows the terminal encoding and properly translates the byte string to a unicode string of one character. On printout, it converts it to a UTF-8 string again, which your terminal displays correctly. >>> print len(u"á") 1 Since your two-byte-UTF-8 sequence is converted to a unicode character, the length of this unicode string is 1. >>> print u"á".upper() Á There are comprehensive capitalization tables available for unicode. >>> str(u"á") Traceback (most recent call last): File "<stdin>", line 1, in <module> __builtin__.UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128) Applying str() to a unicode string must convert it to a byte string. If you don't specify an encoding, the default encoding is "ascii", which can't encode the accented a. Use "a".encode("utf-8"). ---------------------------------------------------------------------- Comment By: Santiago Gala (sgala) Date: 2006-08-17 16:59 Message: Logged In: YES user_id=178886 (I tested it in 2.5rc1), 2.4 gives >>> str(u"á") '\xc3\xa1' instead of the exception ---------------------------------------------------------------------- Comment By: Santiago Gala (sgala) Date: 2006-08-17 16:53 Message: Logged In: YES user_id=178886 The behaviour of python in this area is confusing. See a session with my Spanish keyboard: >>> print "á" á >>> print len("á") 2 >>> print "á".upper() á >>> str("á") '\xc3\xa1' >>> print u"á" á >>> print len(u"á") 1 >>> print u"á".upper() Á >>> str(u"á") Traceback (most recent call last): File "<stdin>", line 1, in <module> __builtin__.UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128) I guess this is what is happening to the reporter. This violates the least surprising behavior principle in so many different ways that it hurts. Can anybody make sense of it? ---------------------------------------------------------------------- Comment By: Ahmet Bişkinler (ahmetbiskinler) Date: 2006-08-11 10:10 Message: Logged In: YES user_id=1481281 What happened? Is it solved? How is it going? What is the final step? ...? ...? Could you please give me some information about the bug please? ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1528802&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com