On Tue, 01 Mar 2005 22:04:15 +0100, André Søreng <[EMAIL PROTECTED]> wrote: > Kent Johnson wrote: > > André Søreng wrote: > > > >> > >> Hi! > >> > >> Given a string, I want to find all ocurrences of > >> certain predefined words in that string. Problem is, the list of > >> words that should be detected can be in the order of thousands. > >> > >> With the re module, this can be solved something like this: > >> > >> import re > >> > >> r = re.compile("word1|word2|word3|.......|wordN") > >> r.findall(some_string) > >> > >> Unfortunately, when having more than about 10 000 words in > >> the regexp, I get a regular expression runtime error when > >> trying to execute the findall function (compile works fine, but slow). > >> > >> I don't know if using the re module is the right solution here, any > >> suggestions on alternative solutions or data structures which could > >> be used to solve the problem? > > > > > > If you can split some_string into individual words, you could look them > > up in a set of known words: > > > > known_words = set("word1 word2 word3 ....... wordN".split()) > > found_words = [ word for word in some_string.split() if word in > > known_words ] > > > > Kent > > > >> > >> André > >> > > That is not exactly what I want. It should discover if some of > the predefined words appear as substrings, not only as equal > words. For instance, after matching "word2sgjoisejfisaword1yguyg", word2 > and word1 should be detected.
Show some initiative, man! >>> known_words = set(["word1", "word2"]) >>> found_words = [word for word in known_words if word in "word2sgjoisejfisawo rd1yguyg"] >>> found_words ['word1', 'word2'] Peace Bill Mill bill.mill at gmail.com -- http://mail.python.org/mailman/listinfo/python-list