In article <53299eac$0$29994$c3e8da3$54964...@news.astraweb.com>, Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> wrote:
> If you have a million items, then the odds that those > million items happen to be sorted (the worst-case order) are 1 in a > million factorial. That's a rather small number, small enough that we can > discount it from ever happening, not in a million lifetimes of the > Universe. If the items are coming from some inherently random process, then I agree. But, not all data sources are random. Imagine you had a web site, and wanted to store various bits of data from the stream of input requests. You decided that the data structure you were going to use was a balanced tree. If your keys were, say, a hash of the client's IP address, then it's a pretty good guess they're random. But, what if the keys were the time the request was received? Those are obviously not random, and using them as a keys in a balanced tree would give you sub-optimum performance. This is not a hypothetical scenario. Songza uses MongoDB as our database. The indexes are balanced trees. One of our biggest collections has a multi-component key, one of the components being the request time truncated to the hour. Insertion time into that collection has an obvious sawtooth shape, with performance degrading as each hour progresses and jumping back up as the time rolls over to the next hour. This is due (we believe :-)) to the constant rebalancing of the index trees. Almost-sorted data sets are very common. For example, here's a pipeline I run a lot (from memory, could have gotten some detail wrong): grep pattern lofgile | awk '{print $7}' | sed 's/:[0-9][0-9]$//' | sort | uniq -c Field 7 is the timestamp for when a request came in. What I'm doing here is counting how many requests of a certain type came in during each minute of the day. Logging isn't exactly in chronological order, so I need to sort the times before I count them. But, it's in *almost* chronological order; a sort which had pathologically bad behavior on almost sorted data would be unusable here. -- https://mail.python.org/mailman/listinfo/python-list