On Nov 24, 4:44 am, Licheng Fang <[EMAIL PROTECTED]> wrote: > > Yes, millions. In my natural language processing tasks, I almost > always need to define patterns, identify their occurrences in a huge > data, and count them. Say, I have a big text file, consisting of > millions of words, and I want to count the frequency of trigrams:
I have some experience with this, helping my wife do computational linguistics. (I also have quite a lot of experience with similar things in my day job, which is a decentralized storage grid written in Python.) Unfortunately, Python is not a perfect tool for the job because, as you've learned, Python isn't overly concerned about conserving memory. Each object has substantial overhead associated with it (including each integer, each string, each tuple, ...), and dicts add overhead due to being sparsely filled. You should do measurements yourself to get results for your local CPU and OS, but I found, for example, that storing 20-byte keys and 8-byte values as a Python dict of Python strings took about 100 bytes per entry. Try "tokenizing" your trigrams by defining a dict from three unigrams to a sequentially allocated integer "trigram id" (also called a "trigram token"), and a reverse dict which goes from a trigram id to the three unigrams. Whenever you create a new set of three Python objects representing unigrams, you can pass them through the first mapping to get the trigram id and then free up the original three Python objects. If you do this multiple times, you get multiple references to the same integer object for the trigram id. My wife and I tried this, but it still wasn't compact enough to process her datasets in a mere 4 GiB of RAM. One tool that might help is PyJudy: http://www.dalkescientific.com/Python/PyJudy.html Judy is a delightfully memory-efficient, fast, and flexible data structure. In the specific example of trigram counting (which is also what my wife was doing), you can, for example, assign each to each unigram an integer, and assuming that you have less than two million unigrams you can pack three unigrams into a 64-bit integer... Hm, actually at this point my wife and I stopped using Python and rewrote it in C using JudyTrees. (At the time, PyJudy didn't exist.) If you are interested, please e-mail my wife, Amber Wilcox-O'Hearn and perhaps she'll share the resulting C code with you. Regards, Zooko Wilcox-O'Hearn -- http://mail.python.org/mailman/listinfo/python-list