Jochen Schulz: > This solution may be more than you actually need, but I implemented two > metric space indexes which would do what you want (and I wanted to plug > it anyway :)):
Please plug such good things. It seems the Python community is an endless source of interesting modules I didn't know about. Your (single) module looks very nice. I'll take a better look later. Few notes: Use xrange instead of range, for Psyco they are often the same, but if you don't have Psyco the range may be slower and usually requires more memory. Java is faster than Python, but if you have some care and you know how things are implemented in Python, you can often write ""fast"" Python code. In the levenshtein function you have this: for i in range(1, m+1): prev, cur = cur, [i] + [0]*n You can speed up that levenshtein, using nearly constant (heap) memory, avoiding that re-creation of prev and cur (see below for D code you can back-translate to Python). You can translate this module to D (I may even do it myself, or I may help you do it), this has several advantages: - If you are careful it may become (quite) faster than your Java version. - The resulting code may be probably as long as the Python one, or not too much longer (but it may have or not have some extra limitations. D is flexible but it's a statically typed language, unlike Python); - Coding in D may be similar enough to Java coding for you, quite differently from coding it in C; - You can then use Pyd (http://pyd.dsource.org ) to create in a very simple way a single compiled module with the functions/classes that can be called by python (no intermediate python needed). Using Pyd is quite simpler than using C + Swig or writing a C module for Python. Later you can release the D code plus the already compiled module for Win too. - Disadvantages: you don't know D, you don't know how to use Pyd, and most people out there may prefer to maintain/modify C code, even if it's 3/4 times longer than the D code. - The following is to show you an example of D code (that uses my D libs for few bits, like that range(), the min(), but it's not too much difficult to avoid using them) that you may compare to the Python one. Note that this code is generic (it's a function template), and it takes as input any array, that means Unicode Strings too (utf8, 16 bit too, 32 bit too), it runs about 105-125 times faster than a quite similar Python version (Psyco is able to speed up this Python code about 20-25 times, so this is 4-5 times faster than Psyco): int editDistance(T)(T[] s1, T[] s2) { if (len(s1) > len(s2)) { auto sa = s1; s1 = s2; s2 = sa; } auto r1 = range(len(s2) + 1); auto r2 = new int[len(r1)]; foreach (i, c1; s1) { r2[0] = i + 1; foreach (j, c2; s2) r2[j+1] = c1 == c2 ? r1[j] : min(r2[j], r1[j], r1[j+1]) + 1; auto ra = r1; r1 = r2; r2 = ra; // swap lines } return r1[$ - 1]; } If you have questions feel free to ask :-) Bye, bearophile -- http://mail.python.org/mailman/listinfo/python-list