Vlastimil Brom <vlastimil.b...@gmail.com> added the comment: I just noticed a somehow strange behaviour in matching character sets or alternate matches which contain some more "advanced" unicode characters, if they are in the search pattern with some "simpler" ones. The former seem to be ignored and not matched (the original re engine matches all of them); (win XPh SP3 Czech, Python 2.7; regex issue2636-20100414)
>>> print u"".join(regex.findall(u".", u"eèéêëēěė")) eèéêëēěė >>> print u"".join(regex.findall(u"[eèéêëēěė]", u"eèéêëēěė")) eèéêëē >>> print u"".join(regex.findall(u"e|è|é|ê|ë|ē|ě|ė", u"eèéêëēěė")) eèéêëē >>> print u"".join(re.findall(u"[eèéêëēěė]", u"eèéêëēěė")) eèéêëēěė >>> print u"".join(re.findall(u"e|è|é|ê|ë|ē|ě|ė", u"eèéêëēěė")) eèéêëēěė even stranger, if the pattern contains only these "higher" unicode characters, everything works ok: >>> print u"".join(regex.findall(u"ē|ě|ė", u"eèéêëēěė")) ēěė >>> print u"".join(regex.findall(u"[ēěė]", u"eèéêëēěė")) ēěė The characters in question are some accented latin letters (here in ascending codepoints), but it can be other scripts as well. >>> print regex.findall(u".", u"eèéêëēěė") [u'e', u'\xe8', u'\xe9', u'\xea', u'\xeb', u'\u0113', u'\u011b', u'\u0117'] The threshold isn't obvious to me, at first I thought, the characters represented as unicode escapes are problematic, whereas those with hexadecimal escapes are ok; however ē - u'\u0113' seems ok too. (python 3.1 behaves identically: >>> regex.findall("[eèéêëēěė]", "eèéêëēěė") ['e', 'è', 'é', 'ê', 'ë', 'ē'] >>> regex.findall("[ēěė]", "eèéêëēěė") ['ē', 'ě', 'ė'] ) vbr ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue2636> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com