Re: Regular Expressions: large amount of or's

André Søreng Wed, 02 Mar 2005 01:35:05 -0800

Bill Mill wrote:

On Tue, 01 Mar 2005 22:04:15 +0100, André Søreng <[EMAIL PROTECTED]> wrote:

Kent Johnson wrote:

André Søreng wrote:

Hi!

Given a string, I want to find all ocurrences of
certain predefined words in that string. Problem is, the list of
words that should be detected can be in the order of thousands.

With the re module, this can be solved something like this:

import re

r = re.compile("word1|word2|word3|.......|wordN")
r.findall(some_string)

Unfortunately, when having more than about 10 000 words in
the regexp, I get a regular expression runtime error when
trying to execute the findall function (compile works fine, but slow).

I don't know if using the re module is the right solution here, any
suggestions on alternative solutions or data structures which could
be used to solve the problem?

If you can split some_string into individual words, you could look them
up in a set of known words:

known_words = set("word1 word2 word3 ....... wordN".split())
found_words = [ word for word in some_string.split() if word in
known_words ]

Kent

André


That is not exactly what I want. It should discover if some of
the predefined words appear as substrings, not only as equal
words. For instance, after matching "word2sgjoisejfisaword1yguyg", word2
and word1 should be detected.

Show some initiative, man!

known_words = set(["word1", "word2"])
found_words = [word for word in known_words if word in "word2sgjoisejfisawo


rd1yguyg"]

found_words


['word1', 'word2']

Peace
Bill Mill
bill.mill at gmail.com

Yes, but I was looking for a solution which would scale. Searching through the same string 10000+++ times does not seem like a suitable solution.

André
--
http://mail.python.org/mailman/listinfo/python-list

Re: Regular Expressions: large amount of or's

Reply via email to