Re: Extracting "true" words

MRAB Fri, 01 Apr 2011 16:21:55 -0700

On 01/04/2011 21:55, candide wrote:

Back again with my study of regular expressions ;) There exists a
special character allowing alphanumeric extraction, the special
character \w (BTW, what the letter 'w' refers to?). But this feature
doesn't permit to extract true words; by "true" I mean word composed
only of _alphabetic_ letters (not digit nor underscore).

The 'w' refers to a 'word' character, although in regex it refers to
letters, digits and the underscore character '_' due to its use in
computer languages (basically, the characters of an identifier or name).


So I was wondering what is the pattern to extract (or to match) _true_
words ? Of course, I don't restrict myself to the ascii universe so that
the pattern [a-zA-Z]+ doesn't meet my needs.

>
Using the re module, you would have to create a character class out of
all the possible letters, something like this:

letter_class = u"[" + u"".join(unichr(c) for c in range(0x10000) ifunichr(c).isalpha()) + u"]"


Alternatively, you could try the new regex implementation here:

    http://pypi.python.org/pypi/regex

which adds support for Unicode properties, and do something like this:

    words = regex.findall(ur"\p{Letter}+", unicode_text)
--
http://mail.python.org/mailman/listinfo/python-list

Re: Extracting "true" words

Reply via email to