Re: No trees in the stdlib?

João Valverde Sat, 27 Jun 2009 20:45:54 -0700

Miles Kaufmann wrote:

João Valverde wrote:
To answer the question of what I need the BSTs for, without gettinginto too many boring details it is to merge and sort IP blocklists,that is, large datasets of ranges in the form of (IP address, IPaddress, string). Originally I was also serializing them in a binaryformat (but no more after a redesign). I kept the "merge and sort"part as a helper script, but that is considerably simpler to implement.
...
As an anecdotal data point (honestly not trying to raise the "Pythonis slow" strawman), I implemented the same algorithm in C and Python,using pyavl. Round numbers were 4 mins vs 4 seconds, against Python(plus pyavl). Even considering I'm a worse Python programmer than Cprogrammer, it's a lot. I know many will probably think I tried to do"C in Python" but that's not the case, at least I don' t think so.Anyway like I said, not really relevant to this discussion.
What format were you using to represent the IP addresses? (Is it aPython class?) And why wouldn't you use a network address/subnet maskpair to represent block ranges? (It seems like being able torepresent ranges that don't fit into a subnet's 2^n block wouldn't bethat common of an occurrence, and that it might be more useful to makethose ranges easier to manipulate.)

I was using a bytes subclass. I'm not free to choose CIDR notation,range boundaries must be arbitrary.

One of the major disadvantages of using a tree container is thatusually multiple comparisons must be done for every tree operation.When that comparison involves a call into Python bytecode (for customcmp/lt methods) the cost can be substantial. Compare that to Python'shash-based containers, which only need to call comparison methods inthe event of hash collisions (and that's hash collisions, not hashtable bucket collisions, since the containers cache each object's hashvalue). I would imagine that tree-based containers would only beworth using with objects with comparison methods implemented in C.

I would flip your statement and say one of the advantages of using treesis that they efficiently keep random input sorted. Obviously noalgorithm can do that with single comparisons. And not requiring a hashfunction is a desirable quality for non-hashable objects. There's aworld beyond dicts. :)

I profiled the code and indeed the comparisons dominated the executiontime. Trimming the comparison function to the bare minimum, a singlepython operation, almost doubled the program's speed.

Not that I'm trying to be an apologist, or reject your arguments; Ican definitely see the use case for a well-implemented, fasttree-based container for Python. And so much the better if, when youneed one, there was a clear consensus about what package to use (likePIL for image manipulation--it won't meet every need, and there areothers out there, but it's usually the first recommended), rather thanhaving to search out and evaluate a dozen different ones.

Thanks, and I'm not trying to start a religion either. ;)

--
http://mail.python.org/mailman/listinfo/python-list

Re: No trees in the stdlib?

Reply via email to