I guess with the actual dataset I'll be able to improve the memory usage a bit, with BioPython::trie. That would probably be enough optimization to continue working with some comfort. On this test code BioPython::trie gives a bit of improvement in terms of memory. Not much though...
>>> d = dict() >>> for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] = >>> array.array('i', (i, i+1, i+2, i+3, i+4, i+5, i+6)) 1000000 keys, ['VmPeak:\t 125656 kB', 'VmSize:\t 125656 kB'], 3.525858 seconds, 283618.896034 keys per second >>> from Bio import trie >>> d = trie.trie() >>> for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] = >>> array.array('i', (i, i+1, i+2, i+3, i+4, i+5, i+6)) 1000000 keys, ['VmPeak:\t 108932 kB', 'VmSize:\t 108932 kB'], 4.142797 seconds, 241382.814950 keys per second -- http://mail.python.org/mailman/listinfo/python-list