On Fri, Apr 1, 2011 at 1:55 PM, candide <candide@free.invalid> wrote: > Back again with my study of regular expressions ;) There exists a special > character allowing alphanumeric extraction, the special character \w (BTW, > what the letter 'w' refers to?).
"Word" presumably/intuitively; hence the non-standard "[:word:]" POSIX-like character class alias for \w in some environments. > But this feature doesn't permit to extract > true words; by "true" I mean word composed only of _alphabetic_ letters (not > digit nor underscore). Are you intentionally excluding CJK ideographs (as not "letters"/alphabetic)? And what of hyphenated terms (e.g. "re-lock")? > So I was wondering what is the pattern to extract (or to match) _true_ words > ? Of course, I don't restrict myself to the ascii universe so that the > pattern [a-zA-Z]+ doesn't meet my needs. AFAICT, there doesn't appear to be a nice way to do this in Python using the std lib `re` module, but I'm not a regex guru. POSIX character classes are unsupported, which rules out "[:alpha:]". \w can be made Unicode/locale-sensitive, but includes digits and the underscore, as you've already pointed out. \p (Unicode property/block testing), which would allow for "\p{Alphabetic}" or similar, is likewise unsupported. Cheers, Chris -- http://blog.rebertia.com -- http://mail.python.org/mailman/listinfo/python-list