Re: Extracting "true" words

John Nagle Fri, 01 Apr 2011 21:08:19 -0700

On 4/1/2011 4:10 PM, Chris Rebert wrote:

On Fri, Apr 1, 2011 at 1:55 PM, candide<[email protected]>  wrote:

Back again with my study of regular expressions ;) There exists a special
character allowing alphanumeric extraction, the special character \w (BTW,
what the letter 'w' refers to?).


"Word" presumably/intuitively; hence the non-standard "[:word:]"
POSIX-like character class alias for \w in some environments.

But this feature doesn't permit to extract
true words; by "true" I mean word composed only of _alphabetic_ letters (not
digit nor underscore).


Are you intentionally excluding CJK ideographs (as not "letters"/alphabetic)?
And what of hyphenated terms (e.g. "re-lock")?


    It's an interesting parsing problem to find word breaks in mixed
language text.  It's quite common to find English and Japanese text
mixed.  (See "http://www.dokidoki6.com/00_index1.html";.  Caution,
excessively cute.) Each ideograph is a "word", of course.

    Parse this into words:

★12/25/2009★
6%DOKIDOKI VISUAL FILE vol.4を公開しました。
アルバムの上部で再生操作、下部でサムネイルがご覧いただけます。

                                John Nagle
--
http://mail.python.org/mailman/listinfo/python-list

Re: Extracting "true" words

Reply via email to