Re: best split tokens?

Tim Chase Fri, 08 Sep 2006 14:43:07 -0700

> py> import re
> py> rgx = re.compile(r'(?:\s+)|[()\[\].,?;-]+')
> py> [s for s in rgx.split(astr) if s]
> ['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'forefathers', 
> 'who', 'art', 'in', 'heaven', 'hallowed', 'be', 'their', 'names', 'did', 
> 'forthwith', 'declare', 'that', 'all', 'men', 'are', 'created', 'to', 
> 'shed', 'their', 'mortal', 'coils', 'and', 'to', 'be', 'given', 'daily', 
> 'bread', 'even', 'in', 'the', 'best', 'of', 'times', 'and', 'the', 
> 'worst', 'of', 'times', 'With', 'liberty', 'and', 'justice', 'for', 
> 'all', 'William', 'Shakespear']


This regexp could be shortened to just

        rgx = re.compile('\W+')

if you don't mind numbers included you text (in the event you 
have things like "fatal1ty", "thing2", or "pdf2txt") which is 
often the case...they should be considered part of the word.

If that's a problem, you should be able to use

        rgx = re.compile('[^a-zA-Z]+')

This is a bit Euro-centric...ideally Python regexps would support 
Posix character classes, so one could use

        rgx = re.compile('[^[:alpha:]]+')


or something of the like...however, that fails on my python2.4 here.

-tkc


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: best split tokens?

Reply via email to