Re: Using re to find unicode ranges

Eric Abrahamsen Mon, 29 Sep 2008 20:50:39 -0700

On Sep 29, 11:03 pm, "Mark Tolonen" <[EMAIL PROTECTED]> wrote:
> "Eric Abrahamsen" <[EMAIL PROTECTED]> wrote in message
>
> news:[EMAIL PROTECTED]
>
> > Is it possible to use the re module to find runs of characters within  a
> > certain Unicode range?
>
> > I'm writing a Markdown extension to go over text and wrap blocks of
> > consecutive Chinese characters in <span class="char"></span> tags for
> > nice styling in an HTML page. The available hooks appear to be a pre-
> > processor (which is a "for line in lines" situation) or an inline  pattern
> > (which uses regular expressions). The regular expression  solution would
> > be much simpler and faster, but something tells me  there's no way to use
> > a regex to find character ranges... Chinese  characters appear to fall
> > between 19968 and 40959 using ord(), and I  suppose I can go that route if
> > necessary, but I think it would be ugly.
>
> # coding: utf-8
> import re
> sample = u'My name is 马克. I am 美国人.'
> for n in re.findall(ur'[\u4e00-\u9fff]+',sample):
>     print n


Of course! And obvious, once you point it out. Thanks for the help.



> This sounds similar to what zhpy (http://pyparsing.wikispaces.com/
> WhosUsingPyparsing#Zhpy) does to extract Chinese words from code, to
> generate executable English Python.  You might give that a look.
> --Mark

Mark - not quite what I'm after here, but pretty interesting
nonetheless...

E
--
http://mail.python.org/mailman/listinfo/python-list

Re: Using re to find unicode ranges

Reply via email to