"Eric Abrahamsen" <[EMAIL PROTECTED]> wrote in message
news:[EMAIL PROTECTED]
Is it possible to use the re module to find runs of characters within a
certain Unicode range?
I'm writing a Markdown extension to go over text and wrap blocks of
consecutive Chinese characters in <span class="char"></span> tags for
nice styling in an HTML page. The available hooks appear to be a pre-
processor (which is a "for line in lines" situation) or an inline pattern
(which uses regular expressions). The regular expression solution would
be much simpler and faster, but something tells me there's no way to use
a regex to find character ranges... Chinese characters appear to fall
between 19968 and 40959 using ord(), and I suppose I can go that route if
necessary, but I think it would be ugly.
# coding: utf-8
import re
sample = u'My name is 马克. I am 美国人.'
for n in re.findall(ur'[\u4e00-\u9fff]+',sample):
print n
output:
马克
美国人
--Mark
--
http://mail.python.org/mailman/listinfo/python-list