Vlastimil Brom <vlastimil.b...@gmail.com> added the comment:

I just noticed a somehow strange behaviour in matching character sets or 
alternate matches which contain some more "advanced" unicode characters, if 
they are in the search pattern with some "simpler" ones. The former seem to be 
ignored and not matched (the original re engine matches all of them); (win XPh 
SP3 Czech, Python 2.7; regex issue2636-20100414)

>>> print u"".join(regex.findall(u".", u"eèéêëēěė"))
eèéêëēěė
>>> print u"".join(regex.findall(u"[eèéêëēěė]", u"eèéêëēěė"))
eèéêëē
>>> print u"".join(regex.findall(u"e|è|é|ê|ë|ē|ě|ė", u"eèéêëēěė"))
eèéêëē
>>> print u"".join(re.findall(u"[eèéêëēěė]", u"eèéêëēěė"))
eèéêëēěė
>>> print u"".join(re.findall(u"e|è|é|ê|ë|ē|ě|ė", u"eèéêëēěė"))
eèéêëēěė

even stranger, if the pattern contains only these "higher" unicode characters, 
everything works ok: 
>>> print u"".join(regex.findall(u"ē|ě|ė", u"eèéêëēěė"))
ēěė
>>> print u"".join(regex.findall(u"[ēěė]", u"eèéêëēěė"))
ēěė


The characters in question are some accented latin letters (here in ascending 
codepoints), but it can be other scripts as well.
>>> print regex.findall(u".", u"eèéêëēěė")
[u'e', u'\xe8', u'\xe9', u'\xea', u'\xeb', u'\u0113', u'\u011b', u'\u0117']

The threshold isn't obvious to me, at first I thought, the characters 
represented as unicode escapes are problematic, whereas those with hexadecimal 
escapes are ok; however ē - u'\u0113' seems ok too.
(python 3.1 behaves identically:
>>> regex.findall("[eèéêëēěė]", "eèéêëēěė")
['e', 'è', 'é', 'ê', 'ë', 'ē']
>>> regex.findall("[ēěė]", "eèéêëēěė")
['ē', 'ě', 'ė']
)

vbr

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue2636>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to