On Aug 6, 11:50 pm, Peter Otten <__pete...@web.de> wrote: > I don't know to what extent it still applys but switching off cyclic garbage > collection with > > import gc > gc.disable()
Haven't tried it on the real dataset. On the synthetic test it (and sys.setcheckinterval(100000)) gave ~2% speedup and no change in memory usage. Not significant. I'll try it on the real dataset though. > while building large datastructures used to speed up things significantly. > That's what I would try first with your real data. > > Encoding your unicode strings as UTF-8 could save some memory. Yes... In fact that's what I'm trying now... .encode('utf-8') definitely creates some clutter in the code, but I guess I can subclass dict... And it does saves memory! A lot of it. Seems to be a bit faster too.... > When your integers fit into two bytes, say, you can use an array.array() > instead of the tuple. Excellent idea. Thanks! And it seems to work too, at least for the test code. Here are some benchmarks (x86 desktop): Unicode key / tuple: >>> for i in xrange(0, 1000000): d[unicode(i)] = (i, i+1, i+2, i+3, i+4, i+5, >>> i+6) 1000000 keys, ['VmPeak:\t 224704 kB', 'VmSize:\t 224704 kB'], 4.079240 seconds, 245143.698209 keys per second >>> for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] = >>> array.array('i', (i, i+1, i+2, i+3, i+4, i+5, i+6)) 1000000 keys, ['VmPeak:\t 201440 kB', 'VmSize:\t 201440 kB'], 4.985136 seconds, 200596.331486 keys per second >>> for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] = (i, i+1, i+2, >>> i+3, i+4, i+5, i+6) 1000000 keys, ['VmPeak:\t 125652 kB', 'VmSize:\t 125652 kB'], 3.572301 seconds, 279931.625282 keys per second Almost halved the memory usage. And faster too. Nice. -- Dmitry -- http://mail.python.org/mailman/listinfo/python-list