Re: Using re to find unicode ranges

Mark Tolonen Mon, 29 Sep 2008 08:06:13 -0700

"Eric Abrahamsen" <[EMAIL PROTECTED]> wrote in messagenews:[EMAIL PROTECTED]

Is it possible to use the re module to find runs of characters within acertain Unicode range?
I'm writing a Markdown extension to go over text and wrap blocks ofconsecutive Chinese characters in <span class="char"></span> tags fornice styling in an HTML page. The available hooks appear to be a pre-processor (which is a "for line in lines" situation) or an inline pattern(which uses regular expressions). The regular expression solution wouldbe much simpler and faster, but something tells me there's no way to usea regex to find character ranges... Chinese characters appear to fallbetween 19968 and 40959 using ord(), and I suppose I can go that route ifnecessary, but I think it would be ugly.


# coding: utf-8
import re
sample = u'My name is 马克. I am 美国人.'
for n in re.findall(ur'[\u4e00-\u9fff]+',sample):
   print n

output:

马克
美国人

--Mark

--
http://mail.python.org/mailman/listinfo/python-list

Re: Using re to find unicode ranges

Reply via email to