On Sat, Nov 9, 2013 at 7:14 PM, <wxjmfa...@gmail.com> wrote: > If you wish to count the the frequency of chars in a text > and store the results in a dict, {char: number_of_that_char, ...}, > do not forget to save the key in utf-XXX, it saves memory.
Oh, if you're that concerned about memory usage of individual characters, try storing them as integers: >>> sys.getsizeof("a") 26 >>> sys.getsizeof("a".encode("utf-32")) 25 >>> sys.getsizeof("a".encode("utf-8")) 18 >>> sys.getsizeof(ord("a")) 14 I really don't see that UTF-32 is much advantage here. UTF-8 happens to be, because I used an ASCII character, but the integer beats them all, even for larger numbers: >>> sys.getsizeof(ord("\U0001d11e")) 16 And there's even less difference on my Linux box, but of course, you never compare against Linux because Python 3.2 wide builds don't suit your numbers. For longer strings, there's an even more efficient way to store them. Just store the memory address - that's going to be 4 bytes or 8, depending on whether it's a 32-bit or 64-bit build of Python. There's a name for this method of comparing strings: interning. Some languages do it automatically for all strings, others (like Python) only when you ask for it. Suddenly it doesn't matter at all what the storage format is - if the two strings are the same, their addresses are the same, and conversely. That's how to make it cheap. > Hint: If you attempt to do the same exercise with > words in a "latin" text, never forget the length average > of a word is approximatively 1000 chars. I think you're confusing length of word with value of picture. ChrisA -- https://mail.python.org/mailman/listinfo/python-list