Steven, thank you for answering. See my comments inline. Perhaps I should have formulated my question a bit differently: Are there any *compact* high performance containers for unicode()/str() objects in Python? By *compact* I don't mean compression. Just optimized for memory usage, rather than performance.
What I'm really looking for is a dict() that maps short unicode strings into tuples with integers. But just having a *compact* list container for unicode strings would help a lot (because I could add a __dict__ and go from it). > Yes, lots of ways. For example, do you *need* large lists? Often a better > design is to use generators and iterators to lazily generate data when > you need it, rather than creating a large list all at once. Yes. I do need to be able to process large data sets. No, there is no way I can use an iterator or lazily generate data when I need it. > An optimization that sometimes may help is to intern strings, so that > there's only a single copy of common strings rather than multiple copies > of the same one. Unfortunately strings are unique (think usernames on facebook or wikipedia). And I can't afford storing them in db/memcached/redis/ etc... Too slow. > Can you compress the data and use that? Without knowing what you are > trying to do, and why, it's really difficult to advise a better way to do > it (other than vague suggestions like "use generators instead of lists"). Yes. I've tried. But I was unable to find a good, unobtrusive way to do that. Every attempt either adds some unnecessary pesky code, or slow, or something like that. See more at: http://bugs.python.org/issue9520 > Very often, it is cheaper and faster to just put more memory in the > machine than to try optimizing memory use. Memory is cheap, your time and > effort is not. Well... I'd really prefer to use say 16 bytes for 10 chars strings and fit data into 8Gb Rather than paying extra $1k for 32Gb. > > Well... 63 bytes per item for very short unicode strings... Is there > > any way to do better than that? Perhaps some compact unicode objects? > > If you think that unicode objects are going to be *smaller* than byte > strings, I think you're badly informed about the nature of unicode. I don't think that that unicode objects are going to be *smaller*! But AFAIK internally CPython uses UTF-8? No? And 63 bytes per item seems a bit excessive. My question was - is there any way to do better than that.... > Python is not a low-level language, and it trades off memory compactness > for ease of use. Python strings are high-level rich objects, not merely a > contiguous series of bytes. If all else fails, you might have to use > something like the array module, or even implement your own data type in > C. Are there any *compact* high performance containers (with dict, list interface) in Python? -- Regards, Dmitry -- http://mail.python.org/mailman/listinfo/python-list