New submission from Vlastimil Brom <vlastimil.b...@gmail.com>: Hi, I just noticed a behaviour of the re.LOCALE flag I can't understand; I first reported this to the new regex implementation, which, however, only mimics the standard lib re in this case: http://code.google.com/p/mrab-regex-hg/issues/detail?id=6 I also couldn't find anything relevant in the tracker, other than some older, already fixed issues; I'm sorry, if I missed something. I thought, the search pattern (?L)\w would match any of the respective string.letters according to the current locale (and possibly additionally [0-9_]).
However, the locale doesn't seem to be reflected in an expected way. >>> unicode_BMP = " " + "".join(unichr(i)for i in range(1, 0x10000)) >>> import locale >>> locale.setlocale(locale.LC_ALL, "") 'Czech_Czech Republic.1250' >>> import re >>> print("".join(re.findall(r"(?L)\w", unicode_BMP))) 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz£¥ª¯³µ¹º¼¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ >>> locale.setlocale(locale.LC_ALL, "Greek") 'Greek_Greece.1253' >>> print("".join(re.findall(r"(?L)\w", unicode_BMP))) 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz¢²³µ¸¹º¼¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþ >>> >>> unicode_BMP = " " + "".join(unichr(i)for i in range(1, 0x10000)) >>> locale.setlocale(locale.LC_ALL, "") 'Czech_Czech Republic.1250' >>> print unicode(string.letters, "windows-1250") ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzŠŚŤŽŹšśťžźŁĄŞŻłµąşĽľżŔÁÂĂÄĹĆÇČÉĘËĚÍÎĎĐŃŇÓÔŐÖŘŮÚŰÜÝŢßŕáâăäĺćçčéęëěíîďđńňóôőöřůúűüýţ >>> locale.setlocale(locale.LC_ALL, "Greek") 'Greek_Greece.1253' >>> print unicode(string.letters, "windows-1253") ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzƒΆµΈΉΊΌΎΏΐΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩΪΫάέήίΰαβγδεζηθικλμνξοπρςστυφχψωϊϋόύώ >>> It seems that the nearest letter set to the result of the re/regex LOCALE flags migt be ascii or US locale: >>> locale.setlocale(locale.LC_ALL, "US") 'English_United States.1252' >>> print unicode(string.letters, "windows-1252") ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzƒŠŒŽšœžŸªµºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ >>> however, there are some differences too, namely between z and À re (?L)\w : Czech z£¥ª¯³µ¹º¼¾¿À Greek z¢²³µ¸¹º¼¾¿À string.letters -- US locale zƒŠŒŽšœžŸªµºÀ (as displayed in tkinter Idle shell) (in either case, there are some items, one wouldn't consider usual word characters, cf. ¿) I am not sure whether there are no other issues (like some encoding/displaying peculiarities in Tkinter), but the re matching using the LOCALE flag don't reflect the locale.setlocale(...) in a transparent way. Is it supposed to work this way and is there another possibility to get the expected locale aware matching, as one might expect according to: http://docs.python.org/library/re.html#re.LOCALE """ Make \w, \W, \b, \B, \s and \S dependent on the current locale. """ using Python 2.7.1, 32 bit; win 7 Home Premium 64-bit, Czech. in Python 3.1.3 as well as 3.2 the result is the same (with the appropriately modified code): ... >>> import locale >>> locale.setlocale(locale.LC_ALL, "") 'Czech_Czech Republic.1250' >>> import re >>> print("".join(re.findall(r"(?L)\w", unicode_BMP))) 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz£¥ª¯³µ¹º¼¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ >>> However, in Python 3, there is no comparison with string.letters available anymore. Regards, Vlastimil Brom ---------- components: Regular Expressions, Unicode messages: 132826 nosy: vbr priority: normal severity: normal status: open title: re.LOCALE doesn't reflect locale.setlocale(...) type: behavior versions: Python 2.7, Python 3.1, Python 3.2 _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue11744> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com