On Sep 29, 11:03 pm, "Mark Tolonen" <[EMAIL PROTECTED]> wrote: > "Eric Abrahamsen" <[EMAIL PROTECTED]> wrote in message > > news:[EMAIL PROTECTED] > > > Is it possible to use the re module to find runs of characters within a > > certain Unicode range? > > > I'm writing a Markdown extension to go over text and wrap blocks of > > consecutive Chinese characters in <span class="char"></span> tags for > > nice styling in an HTML page. The available hooks appear to be a pre- > > processor (which is a "for line in lines" situation) or an inline pattern > > (which uses regular expressions). The regular expression solution would > > be much simpler and faster, but something tells me there's no way to use > > a regex to find character ranges... Chinese characters appear to fall > > between 19968 and 40959 using ord(), and I suppose I can go that route if > > necessary, but I think it would be ugly. > > # coding: utf-8 > import re > sample = u'My name is 马克. I am 美国人.' > for n in re.findall(ur'[\u4e00-\u9fff]+',sample): > print n
Of course! And obvious, once you point it out. Thanks for the help. > This sounds similar to what zhpy (http://pyparsing.wikispaces.com/ > WhosUsingPyparsing#Zhpy) does to extract Chinese words from code, to > generate executable English Python. You might give that a look. > --Mark Mark - not quite what I'm after here, but pretty interesting nonetheless... E -- http://mail.python.org/mailman/listinfo/python-list