On Apr 7, 12:40 am, Peter Otten <__pete...@web.de> wrote: > Peter Otten wrote: > > MooMaster wrote: > > >> Now we can't calculate a meaningful Euclidean distance for something > >> like "Iris-setosa" and "Iris-versicolor" unless we use string-edit > >> distance or something overly complicated, so instead we'll use a > >> simple quantization scheme of enumerating the set of values within the > >> column domain and replacing the strings with numbers (i.e. Iris-setosa > >> = 1, iris-versicolor=2). > > > I'd calculate the distance as > > > def string_dist(x, y, weight=1): > > return weight * (x == y) > > oops, this must of course be (x != y). > > > You don't get a high resolution in that dimension, but you don't introduce > > an element of randomness, either. > > > Peter > >
The randomness doesn't matter too much, all K-means cares about is a distance between two points in a coordinate space and as long as that space is invariant it doesn't matter too much (i.e. we don't want (1,1) becoming (3,1) on the next iteration, or the value for a quantized column changing). With that in mind, I was hoping to be lazy and just go with an enumeration approach... Nevertheless, it does introduce a subtle ordering for nominal data, as if Iris-Setosa =1, Iris-Versicolor=2, and Iris-Virginica=3 then on that scale Iris-Versicolor is intuitively "closer" to virginica than setosa is, when in fact such distances don't mean anything on a nominal scale. I hadn't thought about a function like that, but it makes a lot of sense. Thanks! -- http://mail.python.org/mailman/listinfo/python-list