Travers Naran wrote: > Here's the basic idea. I have a dictionary of substrings (the > substrings stored as keys). I have a list of strings. I want to find > out, for each word in the dictionary, how many times the substring > occurs non-overlapping occurrences there are in the list of strings. > This is the best I could come up with: > > for j in self.dict.keys(): > c = 0 > for i in tc.strings: > c += i.count(j) > self.dict[j]['count'] += c > > I tried Boyer-Moore-Horspool as someone else suggested in the group, but > it was even slower (probably because of the time spent setting up the > pattern). Even when I split the algorithm in two, doing the set-up > before the inner loop and only testing the pattern in the inner loop, it > was still slow.
Are you comparing BMH implemented in Python with str.count() implemented in C? > > So, am I doing this the optimal way or am I overlooking something > _really_ simple to do? > > Statistics on the data: > The dictionary has 3000+ entries in it. > There are several thousand strings. > The strings average 10-26 characters in length. 1. A pure Python suggestion: Presuming there is a character SEP that cannot appear in the keys in your query dictionary, try this: SEP = '\n' # for example big_string_count = SEP.join(tc.strings).count for query_key in self.dict.keys(): self.dict[query_key]['count'] += big_string_count(query_key) Why do we have += in the above line? Is there anything else being added in by not-shown code? If not, the following would be much more comprehensible, and faster: self.dict_count = {} SEP = '\n' # for example big_string_count = SEP.join(tc.strings).count for query_key in self.dict.keys(): self.dict_count[query_key] = big_string_count(query_key) Checking your requirements for "non-overlapping": if you have 'foobar' and 'barzot' in the dictionary, and a string contains 'foobarzot', your code will count 1 for 'foobar' and 1 for 'barzot'. Is that OK? 2. Next, if you don't mind using a C-extension, look at Danny Yoo's ahocorasick module, which will search in parallel for your 3000+ keys. http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/ Under one definition of non-overlapping, you will be able to use one of the findall methods; under the other definition, you will need to use one of the search methods and restart at (matched_position + 1). 3. If you want to roll your own, start with Gonzalo Navarro's publications: http://www.dcc.uchile.cl/~gnavarro/subj-multiple.html HTH, John -- http://mail.python.org/mailman/listinfo/python-list