This really an algorithm question more that a Python question, but it would be implemented in Python....
I have a list of strings, A. I want to find a set of strings B such that for any "a in A" there exists "b in B" such that b is a sub-string of a. But I also want to minimise T = sum_j t_j where t_j = count of the number of elements in A which have b[j] as a sub-string My guess is that finding the smallest possible T satisfying the constraint would be hard. However, for my application just keeping it reasonably small would help. In my case the list A contains over two million addresses. The (top down) heuristic approach I am tempted to employ is to start by dividing the entries in A into sets of tokens, then take the union of all these sets as a starting point for B. Then I would try to trim B by 1. looking for elements that I could remove while still satisfying the constraint 2. replacing two elements by a common sub-string if that reduced T Anyway. It occurred to me that this might be a known problem. Any pointers gratefully received. - Andrew -- http://mail.python.org/mailman/listinfo/python-list