>> rgx = re.compile('\W+') >> >> if you don't mind numbers included you text (in the event you >> have things like "fatal1ty", "thing2", or "pdf2txt") which is >> often the case...they should be considered part of the word. >> >> If that's a problem, you should be able to use >> >> rgx = re.compile('[^a-zA-Z]+') >> >> This is a bit Euro-centric... > > I'd call it half-asscii :-)
groan... :) Given the link you provided, I correct my statement to "Ango-centric", as there are clearly oddball cases in languages such as French. > textbox = "He was wont to be alarmed/amused by answers that won't work" Well, one could do something like >>> s "He was wont to be alarmed/amused by answers that won't work" >>> s2 "The two-faced liar--a real joker--can't tell the truth" >>> r = re.compile("(?:(?:[a-zA-Z][-'][a-zA-Z])|[a-zA-Z])+") >>> r.findall(s), r.findall(s2) (['He', 'was', 'wont', 'to', 'be', 'alarmed', 'amused', 'by', 'answers', 'that', "won't", 'work'], ['The', 'two-faced', 'liar', 'a', 'real', 'joker', "can't", 'tell', 'the', 'truth']) which parses your example the way I would want it to be parsed, and handles the strange string I came up with to try similar examples the way I would expect that it would be broken down by "words"... I had a hard time comin' up with any words I'd want to call "words" where the additional non-word glyph (apostrophe, dash, etc) wasn't 'round the middle of the word. :) Any more crazy examples? :) -tkc -- http://mail.python.org/mailman/listinfo/python-list