Re: Regular Expressions: large amount of or's

Anthra Norell Tue, 01 Mar 2005 15:28:05 -0800

----- Original Message -----

From: "André Søreng" <[EMAIL PROTECTED]>

Newsgroups: comp.lang.python

Sent: Tuesday, March 01, 2005 8:46 PM

Subject: Regular Expressions: large amount of or's

>
> Hi!
>
> Given a string, I want to find all ocurrences of
> certain predefined words in that string. Problem is, the list of
> words that should be detected can be in the order of thousands.
>
> With the re module, this can be solved something like this:
>
> import re
>
> r = re.compile("word1|word2|word3|.......|wordN")
> r.findall(some_string)
>
> Unfortunately, when having more than about 10 000 words in
> the regexp, I get a regular _expression_ runtime error when
> trying to execute the findall function (compile works fine, but slow).
>
> I don't know if using the re module is the right solution here, any
> suggestions on alternative solutions or data structures which could
> be used to solve the problem?
>
> André
>

Here's a function. You can copy and run it. See if that meets your requirements. It's an algorithm I once used for a search-and-replace program. That's why I collect the string indexes. It can easily be adapted. Suppose you only want to know which words occur, not where they occur. You then don't need the string indexes and don't need to return the same word repeatedly for each occurrences. If your list is 10,000 words as you say, some optimizing would be indicated, such as sorting it alphabetically and match only the subset with the initial matching the character at the current string offset.

Salut

Frederic

############################################

def search_words (s = None, wtf = None):

if s == None:

# Unless you provide your own string, it'll use this one

s = 'Personal firewall software may warn about the connection IDLE\n' \
' makes to its subprocess using this computer\'s internal loopback.\n'

if wtf == None:

# unless you provide your own list, it'll use this one

      words_to_find = (
         'fire',
         'wall',
         'process',
         'subprocess',
         'war',
         'ware',
         'international',
         'inter',
         'internal',
         'internals',
         'internal combustion engine',    # You can search any substring, not just words
         ' ',

         '   ',
         '\n',
         '\t',
      )

else:

words_to_find = wtf

WORD = 0
POSITION = 1

words = [] # Put matching words and string offsets into this list to return

length_of_string = len (s)
read_index = 0

while read_index < length_of_string:

# Try to match all words in your list against the string at offset read_index

words_list_index = 0 # Start matching at index 0 ('fire')

all_matches = [] # Collector for all matches at current string offset

for word in words_to_find: # Go through your entire list of words one by one

if s [read_index:].startswith (word): # If current word matches,

all_matches.append ((len (pattern), words_list_index)) # collect its length and index in your list

words_list_index += 1

if all_matches > []: # If one or more matches occurred

all_matches.sort () # Put them in order

all_matches.reverse () # and the longest match on top

words.append ((words_to_find [all_matches [0][POSITION]], read_index)) # Collect the longest matching word. Other strategies

# are possible. Search and replace requires this one.

         read_index += all_matches [0][WORD]                                      # Continue next read just past the matched section,
                                                                                  # unless you allow overlapping matches. re doesn't.
                                                                                  # Search and replace doesn't.

      else:                    # No word matched at current offset

read_index += 1 # Try one further down next

return words # [(' ', 8), ('wall', 13), (' ', 17), ('ware', 22), (' ', 26), etc.

############################################

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regular Expressions: large amount of or's

Reply via email to