Re: Python and Cyrillic characters in regular expression

Fredrik Lundh Fri, 05 Sep 2008 10:46:45 -0700

phasma wrote:

string = u"Привет"
(u'\u041f\u0440\u0438\u0432\u0435\u0442',)


string = u"Hi.Привет"
(u'Hi',)

the [\w\s] pattern you used matches letters, numbers, underscore, andwhitespace. "." doesn't fall into that category, so the "match" methodstops when it gets to that character.


maybe you could use re.sub or re.findall?

>>> # replace all non-alphanumerics with the empty string
>>> re.sub("(?u)\W+", "", string)
u'Hi\u041f\u0440\u0438\u0432\u0435\u0442'

>>> # find runs of alphanumeric characters
>>> re.findall("(?u)\w+", string)
[u'Hi', u'\u041f\u0440\u0438\u0432\u0435\u0442']
>>> "".join(re.findall("(?u)\w+", string))
u'Hi\u041f\u0440\u0438\u0432\u0435\u0442'

(the "sub" example expects you to specify what characters you want toskip, while "findall" expects you to specify what you want to keep.)


</F>

--
http://mail.python.org/mailman/listinfo/python-list

Re: Python and Cyrillic characters in regular expression

Reply via email to