John Machin wrote: > On Jan 12, 8:25 pm, Robert Kern <[EMAIL PROTECTED]> wrote: >> The section on "String Methods"[1] in the Python documentation states that >> for >> the case conversion methods like str.lower(), "For 8-bit strings, this >> method is >> locale-dependent." Is there a guarantee that unicode.lower() is >> locale-*in*dependent? >> >> The section on "Case Conversion" in PEP 100 suggests this, but the code >> itself >> looks like to may call the C function towlower() if it is available. On OS X >> Leopard, the manpage for towlower(3) states that it "uses the current locale" >> though it doesn't say exactly *how* it uses it. >> >> This is the bug I'm trying to fix: >> >> http://scipy.org/scipy/numpy/ticket/643 >> http://dev.laptop.org/ticket/5559 >> >> [1]http://docs.python.org/lib/string-methods.html >> [2]http://www.python.org/dev/peps/pep-0100/ > > The Unicode standard says that case mappings are language-dependent. > It gives the example of the Turkish dotted capital letter I and > dotless small letter i that "caused" the numpy problem. See > http://www.unicode.org/versions/Unicode4.0.0/ch05.pdf#G21180
That doesn't determine the behavior of unicode.lower(), I don't think. That specifies semantics for when one is dealing with a given language in the abstract. That doesn't specify concrete behavior with respect to a given locale setting on a real computer. For example, my strings 'VOID', 'INT', etc. are all English, and I want English case behavior. The language of the data and the transformations I want to apply to the data is English even though the user may have set the locale to something else. > Here is what the Python 2.5.1 unicode implementation does in an > English-language locale: > >>>> import unicodedata as ucd >>>> eyes = u"Ii\u0130\u0131" >>>> for eye in eyes: > ... print repr(eye), ucd.name(eye) > ... > u'I' LATIN CAPITAL LETTER I > u'i' LATIN SMALL LETTER I > u'\u0130' LATIN CAPITAL LETTER I WITH DOT ABOVE > u'\u0131' LATIN SMALL LETTER DOTLESS I >>>> for eye in eyes: > ... print "%r %r %r %r" % (eye, eye.upper(), eye.lower(), > eye.capitalize()) > ... > u'I' u'I' u'i' u'I' > u'i' u'I' u'i' u'I' > u'\u0130' u'\u0130' u'i' u'\u0130' > u'\u0131' u'I' u'\u0131' u'I' > > The conversions for I and i are not correct for a Turkish locale. > > I don't know how to repeat the above in a Turkish locale. If you have the correct locale data in your operating system, this should be sufficient, I believe: $ LANG=tr_TR python Python 2.4.3 (#1, Mar 14 2007, 19:01:42) [GCC 4.1.1 20070105 (Red Hat 4.1.1-52)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import locale >>> locale.setlocale(locale.LC_ALL, '') 'tr_TR' >>> 'VOID'.lower() 'vo\xfdd' >>> 'VOID'.lower().decode('iso-8859-9') u'vo\u0131d' >>> u'VOID'.lower() u'void' >>> > However it appears from your bug ticket that you have a much narrower > problem (case-shifting a small known list of English words like VOID) > and can work around it by writing your own locale-independent casing > functions. Do you still need to find out whether Python unicode > casings are locale-dependent? I would still like to know. There are other places where .lower() is used in numpy, not to mention the rest of my code. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco -- http://mail.python.org/mailman/listinfo/python-list