Re: Turkic I and re

MRAB Thu, 15 Sep 2011 07:09:16 -0700

On 15/09/2011 14:44, John-John Tedro wrote:

On Thu, Sep 15, 2011 at 1:16 PM, Alan Plum <m...@alanplum.com
<mailto:m...@alanplum.com>> wrote:


    On 2011-09-15 15:02, MRAB wrote:

        The regex module at http://pypi.python.org/pypi/__regex
        <http://pypi.python.org/pypi/regex> currently uses a
        compromise, where it matches 'I' with 'i' and also 'I' with 'ı'
        and 'İ'
        with 'i'.

        I was wondering if it would be preferable to have a TURKIC flag
        instead
        ("(?T)" or "(?T:...)" in the pattern).


    I think the problem many people ignore when coming up with solutions
    like this is that while this behaviour is pretty much unique for
    Turkish script, there is no guarantee that Turkish substrings won't
    appear in other language strings (or vice versa).

    For example, foreign names in Turkish are often given as spelled in
    their native (non-Turkish) script variants. Likewise, Turkish names
    in other languages are often given as spelled in Turkish.

    The Turkish 'I' is a peculiarity that will probably haunt us
    programmers until hell freezes over. Unless Turkey abandons its
    traditional orthography or people start speaking only a single
    language at a time (including names), there's no easy way to deal
    with this.

    In other words: the only way to make use of your proposed flag is if
    you have a fully language-tagged input (e.g. an XML document making
    extensive use of xml:lang) and only ever apply regular expressions
    to substrings containing one culture at a time.

    --
    http://mail.python.org/__mailman/listinfo/python-list
    <http://mail.python.org/mailman/listinfo/python-list>


Python does not appear to support special cases mapping, in effect, it
is not 100% compliant with the unicode standard.

The locale specific 'i' casing in Turkic is mentioned in 5.18 (Case
Mappings <http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf#G21180>)
of the unicode standard.
http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf#G21180

AFAIK, the case methods of python strings seems to be built around the
assumption that len("string") == len("string".upper()), but some of
these casing rules require that the string grow. Like uppercasing of the
german sharp s "ß" which should be translated to the expanded string "SS".
These special cases should be triggered on specific locales, but I have
not been able to verify that the Turkic uppercasing of "i" works on
either python 2.6, 2.7 or 3.1:

   locale.setlocale(locale.LC_ALL, "tr_TR.utf8") # warning, requires
turkish locale on your system.
   ord("i".upper()) == 0x130 # is False for me, but should be True

I wouldn't be surprised if these issues are translated into the 're' module.

There has been some discussion on the Python-dev list about improving
Unicode support in Python 3.

It's somewhat unlikely that Unicode will become locale-dependent in
Python because it would cause problems; you don't want:

    "i".upper() == "I"

to be maybe true, maybe false.

An option would be to specify whether it should be locale-dependent.

The only support appears to be 'L' switch, but it only makes "\w, \W,
\b, \B, \s and \S dependent on the current locale".


That flag is for locale-dependent 8-bit encodings. The ASCII (Python
3), LOCALE and UNICODE flags are mutually exclusive.

Which probably does not yield to the special rules mentioned above, but
I could be wrong. Make sure that your locale is correct and test again.

If you are unsuccessful, I don't see a 'Turkic flag' being introduced
into re module any time soon, given the following from PEP 20
"Special cases aren't special enough to break the rules"

That's why I'm interested in the view of Turkish users. The rest of us
will probably never have to worry about it! :-)

(There's a report in the Python bug tracker about this issue, which is
why the regex module has the compromise.)
--
http://mail.python.org/mailman/listinfo/python-list

Re: Turkic I and re

Reply via email to