Counting occurences of words in a list of strings

2005-05-24 Thread Travers Naran
Here's the basic idea.  I have a dictionary of substrings (the substrings 
stored as keys).  I have a list of strings.  I want to find out, for each 
word in the dictionary, how many times the substring occurs non-overlapping 
occurrences there are in the list of strings.  This is the best I could 
come up with:

 for j in self.dict.keys():
 c = 0
 for i in tc.strings:
 c += i.count(j)
 self.dict[j]['count'] += c

I tried Boyer-Moore-Horspool as someone else suggested in the group, but it 
was even slower (probably because of the time spent setting up the 
pattern).  Even when I split the algorithm in two, doing the set-up before 
the inner loop and only testing the pattern in the inner loop, it was still 
slow.

So, am I doing this the optimal way or am I overlooking something _really_ 
simple to do?

Statistics on the data:
The dictionary has 3000+ entries in it.
There are several thousand strings.
The strings average 10-26 characters in length.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Counting occurences of words in a list of strings

2005-05-24 Thread Travers Naran
John Machin wrote:
> Are you comparing BMH implemented in Python with str.count() implemented 
> in C?

Tee-hee.  Yes.  I figured out that was the problem pretty quickly and just 
went back to count().

> 1. A pure Python suggestion:
> 
> Presuming there is a character SEP that cannot appear in the keys in 
> your query dictionary, try this:
> 
> SEP = '\n' # for example
> big_string_count = SEP.join(tc.strings).count
> for query_key in self.dict.keys():
> self.dict[query_key]['count'] += big_string_count(query_key)
> 
> Why do we have += in the above line? Is there anything else being added 
> in by not-shown code? If not, the following would be much more 
> comprehensible, and faster:

There is.  Because the program can loop over mutliple input files, it needs 
to count it across the files.  But that's a minor thing.

> Checking your requirements for "non-overlapping": if you have 'foobar' 
> and 'barzot' in the dictionary, and a string contains 'foobarzot', your 
> code will count 1 for 'foobar' and 1 for 'barzot'. Is that OK?

Exactly.  Let me be more precise: non-overlapping with itself.  I.e., if I 
has "aa", then "" should count 2.  But foorbar and barzot both coming 
back 1 in your example is exactly the behavoir I want.

> 2. Next, if you don't mind using a C-extension, look at Danny Yoo's 
> ahocorasick module, which will search in parallel for your 3000+ keys.
> 
> http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/
> 
> Under one definition of non-overlapping, you will be able to use one of 
> the findall methods; under the other definition, you will need to use 
> one of the search methods and restart at (matched_position + 1).

Nifty!  Thanks muchly.

> 3. If you want to roll your own, start with Gonzalo Navarro's 
> publications: http://www.dcc.uchile.cl/~gnavarro/subj-multiple.html

I don't suffer from NMH syndrome.  If ahocorasick does the job, or even 
count, I'm OK with that.
-- 
http://mail.python.org/mailman/listinfo/python-list