dmtr wrote: > On Aug 6, 11:50 pm, Peter Otten <__pete...@web.de> wrote: >> I don't know to what extent it still applys but switching off cyclic >> garbage collection with >> >> import gc >> gc.disable() > > > Haven't tried it on the real dataset. On the synthetic test it (and > sys.setcheckinterval(100000)) gave ~2% speedup and no change in memory > usage. Not significant. I'll try it on the real dataset though. > > >> while building large datastructures used to speed up things >> significantly. That's what I would try first with your real data. >> >> Encoding your unicode strings as UTF-8 could save some memory. > > Yes... In fact that's what I'm trying now... .encode('utf-8') > definitely creates some clutter in the code, but I guess I can > subclass dict... And it does saves memory! A lot of it. Seems to be a > bit faster too.... > >> When your integers fit into two bytes, say, you can use an array.array() >> instead of the tuple. > > Excellent idea. Thanks! And it seems to work too, at least for the > test code. Here are some benchmarks (x86 desktop): > > Unicode key / tuple: >>>> for i in xrange(0, 1000000): d[unicode(i)] = (i, i+1, i+2, i+3, i+4, >>>> i+5, i+6) > 1000000 keys, ['VmPeak:\t 224704 kB', 'VmSize:\t 224704 kB'], > 4.079240 seconds, 245143.698209 keys per second > >>>> for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] = >>>> array.array('i', (i, i+1, i+2, i+3, i+4, i+5, i+6)) > 1000000 keys, ['VmPeak:\t 201440 kB', 'VmSize:\t 201440 kB'], > 4.985136 seconds, 200596.331486 keys per second > >>>> for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] = (i, i+1, >>>> i+2, i+3, i+4, i+5, i+6) > 1000000 keys, ['VmPeak:\t 125652 kB', 'VmSize:\t 125652 kB'], > 3.572301 seconds, 279931.625282 keys per second > > Almost halved the memory usage. And faster too. Nice.
> def benchmark_dict(d, N): > start = time.time() > > for i in xrange(N): > length = lengths[random.randint(0, 255)] > word = ''.join([ letters[random.randint(0, 255)] for i in xrange(length) ]) > d[word] += 1 > > dt = time.time() - start > vm = re.findall("(VmPeak.*|VmSize.*)", open('/proc/%d/status' % os.getpid()).read()) > print "%d keys (%d unique), %s, %f seconds, %f keys per second" % (N, len(d), vm, dt, N / dt) > Looking at your benchmark, random.choice(letters) has probably less overhead than letters[random.randint(...)]. You might even try to inline it as letters[int(random.random())*256)] Peter -- http://mail.python.org/mailman/listinfo/python-list