It depends on the language as it was suggested, and it also depends on how a token is defined. Can it have dashes, underscores, numbers and stuff? This will also determine what the whitespace will be. Then the two main methods of doing the splitting is to either cut based on whitespace (specify whitespace explicitly) or pick out only valid token symbols uninterrupted by any whitespace (specify valid symbols explicitly).
Nick V. Tim Chase wrote: > >> rgx = re.compile('\W+') > >> > >> if you don't mind numbers included you text (in the event you > >> have things like "fatal1ty", "thing2", or "pdf2txt") which is > >> often the case...they should be considered part of the word. > >> > >> If that's a problem, you should be able to use > >> > >> rgx = re.compile('[^a-zA-Z]+') > >> > >> This is a bit Euro-centric... > > > > I'd call it half-asscii :-) > > groan... :) > > Given the link you provided, I correct my statement to > "Ango-centric", as there are clearly oddball cases in languages > such as French. > > > textbox = "He was wont to be alarmed/amused by answers that won't work" > > Well, one could do something like > > >>> s > "He was wont to be alarmed/amused by answers that won't work" > >>> s2 > "The two-faced liar--a real joker--can't tell the truth" > >>> r = re.compile("(?:(?:[a-zA-Z][-'][a-zA-Z])|[a-zA-Z])+") > >>> r.findall(s), r.findall(s2) > (['He', 'was', 'wont', 'to', 'be', 'alarmed', 'amused', 'by', > 'answers', 'that', "won't", 'work'], ['The', 'two-faced', 'liar', > 'a', 'real', 'joker', "can't", 'tell', 'the', 'truth']) > > > which parses your example the way I would want it to be parsed, > and handles the strange string I came up with to try similar > examples the way I would expect that it would be broken down by > "words"... > > I had a hard time comin' up with any words I'd want to call > "words" where the additional non-word glyph (apostrophe, dash, > etc) wasn't 'round the middle of the word. :) > > Any more crazy examples? :) > > -tkc -- http://mail.python.org/mailman/listinfo/python-list