Bill Mill wrote:
On Tue, 01 Mar 2005 22:04:15 +0100, Andr� S�reng <[EMAIL PROTECTED]> wrote:

Kent Johnson wrote:

Andr� S�reng wrote:


Hi!

Given a string, I want to find all ocurrences of
certain predefined words in that string. Problem is, the list of
words that should be detected can be in the order of thousands.

With the re module, this can be solved something like this:

import re

r = re.compile("word1|word2|word3|.......|wordN")
r.findall(some_string)

Unfortunately, when having more than about 10 000 words in
the regexp, I get a regular expression runtime error when
trying to execute the findall function (compile works fine, but slow).

I don't know if using the re module is the right solution here, any
suggestions on alternative solutions or data structures which could
be used to solve the problem?


If you can split some_string into individual words, you could look them
up in a set of known words:

known_words = set("word1 word2 word3 ....... wordN".split())
found_words = [ word for word in some_string.split() if word in
known_words ]

Kent


Andr�


That is not exactly what I want. It should discover if some of the predefined words appear as substrings, not only as equal words. For instance, after matching "word2sgjoisejfisaword1yguyg", word2 and word1 should be detected.


Show some initiative, man!


known_words = set(["word1", "word2"])
found_words = [word for word in known_words if word in "word2sgjoisejfisawo

rd1yguyg"]

found_words

['word1', 'word2']

Peace
Bill Mill
bill.mill at gmail.com

Yes, but I was looking for a solution which would scale. Searching through the same string 10000+++ times does not seem like a suitable solution.


Andr�
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to