John Machin wrote: > Are you comparing BMH implemented in Python with str.count() implemented > in C?
Tee-hee. Yes. I figured out that was the problem pretty quickly and just went back to count(). > 1. A pure Python suggestion: > > Presuming there is a character SEP that cannot appear in the keys in > your query dictionary, try this: > > SEP = '\n' # for example > big_string_count = SEP.join(tc.strings).count > for query_key in self.dict.keys(): > self.dict[query_key]['count'] += big_string_count(query_key) > > Why do we have += in the above line? Is there anything else being added > in by not-shown code? If not, the following would be much more > comprehensible, and faster: There is. Because the program can loop over mutliple input files, it needs to count it across the files. But that's a minor thing. > Checking your requirements for "non-overlapping": if you have 'foobar' > and 'barzot' in the dictionary, and a string contains 'foobarzot', your > code will count 1 for 'foobar' and 1 for 'barzot'. Is that OK? Exactly. Let me be more precise: non-overlapping with itself. I.e., if I has "aa", then "aaaa" should count 2. But foorbar and barzot both coming back 1 in your example is exactly the behavoir I want. > 2. Next, if you don't mind using a C-extension, look at Danny Yoo's > ahocorasick module, which will search in parallel for your 3000+ keys. > > http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/ > > Under one definition of non-overlapping, you will be able to use one of > the findall methods; under the other definition, you will need to use > one of the search methods and restart at (matched_position + 1). Nifty! Thanks muchly. > 3. If you want to roll your own, start with Gonzalo Navarro's > publications: http://www.dcc.uchile.cl/~gnavarro/subj-multiple.html I don't suffer from NMH syndrome. If ahocorasick does the job, or even count, I'm OK with that. -- http://mail.python.org/mailman/listinfo/python-list