On Apr 7, 12:38 am, Peter Otten <__pete...@web.de> wrote: > MooMaster wrote: > > Now we can't calculate a meaningful Euclidean distance for something > > like "Iris-setosa" and "Iris-versicolor" unless we use string-edit > > distance or something overly complicated, so instead we'll use a > > simple quantization scheme of enumerating the set of values within the > > column domain and replacing the strings with numbers (i.e. Iris-setosa > > = 1, iris-versicolor=2). > > I'd calculate the distance as > > def string_dist(x, y, weight=1): > return weight * (x == y) > > You don't get a high resolution in that dimension, but you don't introduce > an element of randomness, either.
Does the algorithm require well-ordered data along the dimensions? Though I've never heard of it, the fact that it's called "bisecting Kmeans" suggests to me that it does, which means this wouldn't work. However, the OP better be sure to set the scales for the quantized dimensions high enough so that no clusters form containing points with different discrete values. That, in turn, suggests he might as well not even bother sending the discrete values to the clustering algorithm, but instead to call it for each unique set of discretes. (However, I could imagine the marginal cost of more dimensions is less than that of multiple runs; I've been dealing with such a case at work.) I'll leave it to the OP to decide. Carl Banks -- http://mail.python.org/mailman/listinfo/python-list