----- Original Message -----
From: "André Søreng" <[EMAIL PROTECTED]>
Newsgroups: comp.lang.python
To: <python-list@python.org>
Sent: Tuesday, March 01, 2005 8:46 PM
Subject: Regular Expressions: large amount of or's

>
> Hi!
>
> Given a string, I want to find all ocurrences of
> certain predefined words in that string. Problem is, the list of
> words that should be detected can be in the order of thousands.
>
> With the re module, this can be solved something like this:
>
> import re
>
> r = re.compile("word1|word2|word3|.......|wordN")
> r.findall(some_string)
>
> Unfortunately, when having more than about 10 000 words in
> the regexp, I get a regular _expression_ runtime error when
> trying to execute the findall function (compile works fine, but slow).
>
> I don't know if using the re module is the right solution here, any
> suggestions on alternative solutions or data structures which could
> be used to solve the problem?
>
> André
>
Here's a function. You can copy and run it. See if that meets your requirements. It's an algorithm I once used for a search-and-replace program. That's why I collect the string indexes. It can easily be adapted. Suppose you only want to know which words occur, not where they occur. You then don't need the string indexes and don't need to return the same word repeatedly for each occurrences. If your list is 10,000 words as you say, some optimizing would be indicated, such as sorting it alphabetically and match only the subset with the initial matching the character at the current string offset.
 
Salut
 
Frederic
 
 
 
############################################
 
 
def search_words (s = None, wtf = None):
 
 
   if s == None:
      # Unless you provide your own string, it'll use this one
 
      s = 'Personal firewall software may warn about the connection IDLE\n' \
          '    makes to its subprocess using this computer\'s internal loopback.\n'
 
 
   if wtf == None:
 
      # unless you provide your own list, it'll use this one
 
      words_to_find = (
         'fire',
         'wall',
         'process',
         'subprocess',
         'war',
         'ware',
         'international',
         'inter',
         'internal',
         'internals',
         'internal combustion engine',    # You can search any substring, not just words
         ' ',         
         '   ',
         '\n',
         '\t',
      )
 
   else:

      words_to_find = wtf
 

   WORD     = 0
   POSITION = 1
 
   words = []      # Put matching words and string offsets into this list to return
 
   length_of_string = len (s)
   read_index       = 0
 
   while read_index < length_of_string:
 
      # Try to match all words in your list against the string at offset read_index
 
      words_list_index = 0              # Start matching at index 0 ('fire')
 
      all_matches = []                  # Collector for all matches at current string offset
 
      for word in words_to_find:        # Go through your entire list of words one by one
 
         if s [read_index:].startswith (word):   # If current word matches,
 
            all_matches.append ((len (pattern), words_list_index))      # collect its length and index in your list 
 
         words_list_index += 1
 
      if all_matches > []:          # If one or more matches occurred
 
         all_matches.sort ()        # Put them in order
         all_matches.reverse ()     # and the longest match on top
         words.append ((words_to_find [all_matches [0][POSITION]], read_index))   # Collect the longest matching word. Other strategies
                                                                                  # are possible. Search and replace requires this one.
 
         read_index += all_matches [0][WORD]                                      # Continue next read just past the matched section,
                                                                                  # unless you allow overlapping matches. re doesn't.
                                                                                  # Search and replace doesn't.
                                                                                 
      else:                    # No word matched at current offset
         read_index += 1       # Try one further down next
 
   return words                # [(' ', 8), ('wall', 13), (' ', 17), ('ware', 22), (' ', 26), etc.
 
############################################
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to