----- Original Message -----
From: "André Søreng" <[EMAIL PROTECTED]>
Newsgroups: comp.lang.python
To: <python-list@python.org>
Sent: Tuesday, March 01, 2005 8:46 PM
Subject: Regular Expressions: large amount of or's
>
> Hi!
>
> Given a string, I want to find all ocurrences of
> certain predefined words in that string. Problem is, the list of
> words that should be detected can be in the order of thousands.
>
> With the re module, this can be solved something like this:
>
> import re
>
> r = re.compile("word1|word2|word3|.......|wordN")
> r.findall(some_string)
>
> Unfortunately, when having more than about 10 000 words in
> the regexp, I get a regular _expression_ runtime error when
> trying to execute the findall function (compile works fine, but slow).
>
> I don't know if using the re module is the right solution here, any
> suggestions on alternative solutions or data structures which could
> be used to solve the problem?
>
> André
>
> Hi!
>
> Given a string, I want to find all ocurrences of
> certain predefined words in that string. Problem is, the list of
> words that should be detected can be in the order of thousands.
>
> With the re module, this can be solved something like this:
>
> import re
>
> r = re.compile("word1|word2|word3|.......|wordN")
> r.findall(some_string)
>
> Unfortunately, when having more than about 10 000 words in
> the regexp, I get a regular _expression_ runtime error when
> trying to execute the findall function (compile works fine, but slow).
>
> I don't know if using the re module is the right solution here, any
> suggestions on alternative solutions or data structures which could
> be used to solve the problem?
>
> André
>
Here's a function. You can copy and run it. See if that meets your
requirements. It's an algorithm I once used for a search-and-replace program.
That's why I collect the string indexes. It can easily be adapted. Suppose you
only want to know which words occur, not where they occur. You then don't need
the string indexes and don't need to return the same word repeatedly for each
occurrences. If your list is 10,000 words as you say, some optimizing would be
indicated, such as sorting it alphabetically and match only the subset with the
initial matching the character at the current string offset.
Salut
Frederic
############################################
def search_words (s = None, wtf = None):
if s == None:
# Unless you provide your
own string, it'll use this one
s = 'Personal firewall
software may warn about the connection IDLE\n'
\
' makes to its subprocess using this computer\'s internal loopback.\n'
' makes to its subprocess using this computer\'s internal loopback.\n'
if wtf == None:
# unless you provide your
own list, it'll use this one
words_to_find =
(
'fire',
'wall',
'process',
'subprocess',
'war',
'ware',
'international',
'inter',
'internal',
'internals',
'internal combustion engine', # You can search any substring, not just words
' ',
'fire',
'wall',
'process',
'subprocess',
'war',
'ware',
'international',
'inter',
'internal',
'internals',
'internal combustion engine', # You can search any substring, not just words
' ',
' ',
'\n',
'\t',
)
'\n',
'\t',
)
else:
words_to_find = wtf
WORD = 0
POSITION = 1
words = [] #
Put matching words and string offsets into this list to return
length_of_string = len (s)
read_index = 0
read_index = 0
while read_index <
length_of_string:
# Try to match all words
in your list against the string at offset read_index
words_list_index =
0
# Start matching at index 0 ('fire')
all_matches =
[]
# Collector for all matches at current string offset
for word in
words_to_find: # Go through your
entire list of words one by one
if s
[read_index:].startswith (word): # If current word
matches,
all_matches.append ((len (pattern),
words_list_index)) # collect its length and
index in your list
words_list_index += 1
if all_matches >
[]: # If one or more
matches occurred
all_matches.sort () # Put them in
order
all_matches.reverse () # and the longest match on
top
words.append ((words_to_find [all_matches [0][POSITION]],
read_index)) # Collect the longest matching word. Other strategies
# are possible. Search and replace requires this one.
read_index += all_matches
[0][WORD]
# Continue next read just past the matched
section,
# unless you allow overlapping matches. re doesn't.
# Search and replace doesn't.
else: # No word matched at current offset
# unless you allow overlapping matches. re doesn't.
# Search and replace doesn't.
else: # No word matched at current offset
read_index += 1 # Try one further down
next
return
words
# [(' ', 8), ('wall', 13), (' ', 17), ('ware', 22), (' ', 26),
etc.
############################################
-- http://mail.python.org/mailman/listinfo/python-list