[issue9520] Add Patricia Trie high performance container (python's defaultdict(int) is unusable on datasets with 10, 000, 000+ keys.)

Dmitry Chichkov Wed, 04 Aug 2010 17:59:31 -0700

New submission from Dmitry Chichkov <dchich...@gmail.com>:

On large data sets (10-100 million keys) the default python dictionary 
implementation fails to meet memory and performance constraints. It also 
apparently fails to keep O(1) complexity (after just 1M keys). As such, there 
is a need for good, optimized, practical implementation of a high-performance 
container that can store such datasets.


One of the alternatives is a regular Patricia Trie data structure. It can meet 
performance requirements on such datasets. Criteria:
* strictly O(1);
* works well on large datasets (~100M keys); memory efficient;
* same or better performance as dict();
* supports regular dict | counter interface;
* supports unicode keys;
* supports any characters in the keys;
* persistent (can be pickled); 
* in-memory library;

There are a few existing implementations available:
* BioPython trie - http://biopython.org/
* py-radix - http://www.mindrot.org/projects/py-radix/
* PyPi trie - http://pypi.python.org/pypi/trie

A few other relevant alternatives/implementations:
* NIST trie - http://www.itl.nist.gov/div897/sqg/dads/HTML/patriciatree.html
* C++ Trie Library - http://wikipedia-clustering.speedblue.org/trie.php
* PyAvl trie - http://pypi.python.org/pypi/pyavl/1.12_1
* PyJudy tree - http://www.dalkescientific.com/Python/PyJudy.html
* Redis - http://code.google.com/p/redis/
* Memcached - http://memcached.org/

An alternative to a basic Patricia Trie could be some state-of-the-art approach 
from some modern research (i.e. 
http://www.aclweb.org/anthology/W/W09/W09-1505.pdf ), 


The best existing implementation I've been able to find so far is one in the 
BioPython. Compared to defaultdict(int) on the task of counting words. Dataset 
123,981,712 words (6,504,484 unique), 1..21 characters long:
* bio.tree - 459 Mb/0.13 Hours, good O(1) behavior
* defaultdict(int) - 693 Mb/0.32 Hours, poor, almost O(N) behavior

At 8,000,0000 keys python defaultdict(int) starts showing almost O(N) behavior 
and gets unusable with 10,000,000+ unique keys. 



A few usage/installatio notes on BioPython trie:
$ sudo apt-get install python-biopython

>>> from Bio import trie
>>> trieobj = trie.trie()
>>> trieobj["hello"] = 5
>>> trieobj["hello"] += 1
>>> print trieobj["hello"]
>>> print trieobj.keys()

More examples at:
http://python-biopython.sourcearchive.com/documentation/1.54/test__triefind_8py-source.html

----------
components: Library (Lib)
messages: 112937
nosy: dmtr
priority: normal
severity: normal
status: open
title: Add Patricia Trie high performance container (python's defaultdict(int) 
is unusable on datasets with 10,000,000+ keys.)
type: performance
versions: Python 3.1, Python 3.2, Python 3.3

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue9520>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue9520] Add Patricia Trie high performance container (python's defaultdict(int) is unusable on datasets with 10, 000, 000+ keys.)

Reply via email to