> py> import re > py> rgx = re.compile(r'(?:\s+)|[()\[\].,?;-]+') > py> [s for s in rgx.split(astr) if s] > ['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'forefathers', > 'who', 'art', 'in', 'heaven', 'hallowed', 'be', 'their', 'names', 'did', > 'forthwith', 'declare', 'that', 'all', 'men', 'are', 'created', 'to', > 'shed', 'their', 'mortal', 'coils', 'and', 'to', 'be', 'given', 'daily', > 'bread', 'even', 'in', 'the', 'best', 'of', 'times', 'and', 'the', > 'worst', 'of', 'times', 'With', 'liberty', 'and', 'justice', 'for', > 'all', 'William', 'Shakespear']
This regexp could be shortened to just rgx = re.compile('\W+') if you don't mind numbers included you text (in the event you have things like "fatal1ty", "thing2", or "pdf2txt") which is often the case...they should be considered part of the word. If that's a problem, you should be able to use rgx = re.compile('[^a-zA-Z]+') This is a bit Euro-centric...ideally Python regexps would support Posix character classes, so one could use rgx = re.compile('[^[:alpha:]]+') or something of the like...however, that fails on my python2.4 here. -tkc -- http://mail.python.org/mailman/listinfo/python-list