On Tue, 01 Mar 2005 22:04:15 +0100, Andr� S�reng <[EMAIL PROTECTED]> wrote:
Kent Johnson wrote:
Andr� S�reng wrote:
Hi!
Given a string, I want to find all ocurrences of certain predefined words in that string. Problem is, the list of words that should be detected can be in the order of thousands.
With the re module, this can be solved something like this:
import re
r = re.compile("word1|word2|word3|.......|wordN") r.findall(some_string)
Unfortunately, when having more than about 10 000 words in the regexp, I get a regular expression runtime error when trying to execute the findall function (compile works fine, but slow).
I don't know if using the re module is the right solution here, any suggestions on alternative solutions or data structures which could be used to solve the problem?
If you can split some_string into individual words, you could look them up in a set of known words:
known_words = set("word1 word2 word3 ....... wordN".split()) found_words = [ word for word in some_string.split() if word in known_words ]
Kent
Andr�
That is not exactly what I want. It should discover if some of the predefined words appear as substrings, not only as equal words. For instance, after matching "word2sgjoisejfisaword1yguyg", word2 and word1 should be detected.
Show some initiative, man!
known_words = set(["word1", "word2"]) found_words = [word for word in known_words if word in "word2sgjoisejfisawo
rd1yguyg"]
found_words
['word1', 'word2']
Peace Bill Mill bill.mill at gmail.com
Yes, but I was looking for a solution which would scale. Searching through the same string 10000+++ times does not seem like a suitable solution.
Andr� -- http://mail.python.org/mailman/listinfo/python-list