Tim Peters <t...@python.org> added the comment:

Stefan, thanks - I think I understand what you're driving at now.

You're (re)discovering that sorting by lexicographic ordering rules is 
_inherently_ suboptimal if list elements can only be compared "starting at 
index 0" each time. Not just tuples, but also, e.g., lists, byte arrays, and 
even strings. Any sequence type whose ordering derives from lexicographic 
generalization of its elements' ordering.

This is quite independent of whether or not CPython tries to exploit type 
homogeneity.

The usual(*) solution is indeed to exploit a kind of bucket sort instead, first 
sorting at elements' index 0 alone, saying all those elements with the same 
value at index 0 constitute "a bucket", and then applying the same idea to each 
bucket but sorting on index 1 instead. Where those in turn are partitioned into 
buckets equal at index 1, and those buckets are similarly each sorted on index 
2. And so on, and so on.

No doubt that can be a big win, and even in the absence of the type homogeneity 
tricks. It's far beyond the scope of _this_ PR, though, and is really an 
entirely different approach.

I've thought about it at times, but it's "a real project" to do it right. In 
effect, it would implement an optimized generalization of the SO OP's 
"automagically convert multikey to multisort" wish.

Fine by me if someone pursues that, but I don't have the energy for it, nor 
much interest either in doing a half-hearted job of it.

I always expected that if it came up at all, it would be in the context of 
sorting lists of strings.  For example, consider:

    xs = ['x' * 10_000 + str(i) for i in range(1_000_000)]
    random.shuffle(xs)

Even with type homogeneity tricks, Python's sorting of the result is very far 
from optimal. Every single comparison starts by burning time skipping over the 
(at least) ten thousands equal characters at the start.

The "bucket sort" (or it could be viewed as a form of MSD radix sort) quickly 
discovers that all the characters at index 0 are equal, and also all those at 
index 1, ..., and all those at index 9999. The "real work" of sorting is then 
reduced to working on the (at most) 6 characters remaining.

(*) But I'm lying ;-) The actual usual solution is to use "multikey quicksort", 
because it's easy to code and works "in place". But it's not a stable sort. 
String sorting doesn't care about that, but sorting other kinds of sequences 
can care a lot.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue45530>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to