Kent Johnson <[EMAIL PROTECTED]> wrote: :> Given a string, I want to find all ocurrences of :> certain predefined words in that string. Problem is, the list of :> words that should be detected can be in the order of thousands. :> :> With the re module, this can be solved something like this: :> :> import re :> :> r = re.compile("word1|word2|word3|.......|wordN") :> r.findall(some_string)
The internal data structure that encodes that set of keywords is probably humongous. An alternative approach to this problem is to tokenize your string into words, and then check to see if each word is in a defined list of "keywords". This works if your keywords are single words: ### keywords = set([word1, word2, ...]) matchingWords = set(re.findall(r'\w+')).intersection(keywords) ### Would this approach work for you? Otherwise, you may want to look at a specialized data structure for doing mutiple keyword matching; I had an older module that wrapped around a suffix tree: http://hkn.eecs.berkeley.edu/~dyoo/python/suffix_trees/ It looks like other folks, thankfully, have written other implementations of suffix trees: http://cs.haifa.ac.il/~shlomo/suffix_tree/ Another approach is something called the Aho-Corasick algorithm: http://portal.acm.org/citation.cfm?doid=360825.360855 though I haven't been able to find a nice Python module for this yet. Best of wishes to you! -- http://mail.python.org/mailman/listinfo/python-list