>barnesc at engr.orst.edu wrote: > > So my question is: are there any other *practical* applications of a > > B-tree based list/set/dict ? In other words, is this module totally > > worth coding, or is it just academic wankery and theoretical flim > > flam ? :) > >B-trees are specifically designed for disk storage. Seeking to a >page takes much longer than reading a page of several kilobytes. >B-trees read the keys for a many-way comparison in one seek. > >For in-memory comparison-based dictionaries, Red-Black trees, >AVL trees, 2-3 trees, or skip-lists, are a better choice. > > >-- >--Bryan
--------------------------------------------------------------------- Memory Usage --------------------------------------------------------------------- Here overhead is compared to a C array of > 1 million PyObject *s. The B-tree is assumed to have a branch factor of ~1024. A B-tree has an average of 1.5x overhead (2x at worst, 1x at best, using malloc). This assumes that B-tree nodes contain a fixed size C array for storing values (which will be half-full on average), and that the child array is only allocated for internal nodes as needed. Any balanced binary tree with left and right child pointers has a 6x overhead (using malloc, measured via a gcc win32 test program), or with a free list, a 3x overhead (though memory will not be returned until the data structure's destructor is called). I haven't tried using PyObject_Malloc(), but it should be in between these two factors. A skip list has n values and n next pointers on the bottom level. In the ideal case (with respect to memory usage), we make the probability sufficiently small so that the higher levels take up a negligible amount of memory. Thus we have a 4x overhead (using malloc, measured via a gcc win32 test program), or with a free list, an overhead slightly greater than 2x (again, memory will only be returned when the destructor is called). I didn't test PyObject_Malloc(), but it should give an overhead between 2x and 4x. Thus, on average, a > 1 million element B-tree uses 25% less memory than any other balanced data structure which I am aware of, and 50% more memory than a raw C array. --------------------------------------------------------------------- Speed --------------------------------------------------------------------- I have no idea how B-trees compare to skip lists (the likely contender) in terms of speed. To do this, one would need two high performance C implementations. >From this, the Python performance can presumably be deduced (although caching causes unpredictable effects, the faster C algorithm should be faster in Python). You could start with optimized parallel C implementations, such as the ones by John-Mark Gurney: * http://resnet.uoregon.edu/~gurney_j/jmpc/btree.html * http://resnet.uoregon.edu/~gurney_j/jmpc/skiplist.html He indicates that the skip list is slower than the B-tree, but doesn't give any solid numbers. Of course you'd have to check his code and optimize it to death (Perhaps representing a sorted set of integers using each respective data structure would be a good test). Also note that my B-tree code currently has optimized special cases for block insertions, deletions, and retrievals, since this is easy (er, well, relatively speaking) to do in the B-tree case. In the skip list case, block retrievals are easy (though not as fast as B-tree block retrievals due to pointer indirection and cache problems). I have no idea how to do a fast block insert or delete on a skip list. ---------------------------------------------------------------------- Well, enough ivory tower silliness and academic blustering. <on_topic> Are there any *practical* applications for in-memory balanced data structures (e.g. skip list, AVL tree, RB tree, B tree) ? </on_topic> - Connelly Barnes -- http://mail.python.org/mailman/listinfo/python-list