>>>>> "DC" == David Carlson <dcarl...@tamu.edu> >>>>> on Tue, 6 Aug 2013 10:26:56 -0500 writes:
> What do you mean by representing the categorical fields by 1:k? > a <- c("red", "green", "blue", "orange", "yellow") > becomes > a <- c(1, 2, 3, 4, 5) > That guarantees your results are worthless worthless indeed! > unless your categories > have an inherent order (e.g. tiny, small, medium, big, giant). > Otherwise it should be four (k-1) indicator/dummy variables (e.g.): > a.red <- c(1, 0, 0, 0, 0) > a.green <- c(0, 1, 0, 0, 0) > a.blue <- c(0, 0, 1, 0, 0) > a.orange <- c(0, 0, 0, 1, 0) > Then you can use Euclidean distance. Yes, ... or use Gower's or other similarly sophisticated distances, as you (David) mentioned earlier in this thread. Do also note that a generalized Gower's distance (+ weighting of variables) is available from the ('recommended' hence always installed) package 'cluster' : require("cluster") ?daisy ## notably daisy(*, metric="gower") Note that daisy() is more sophisticated than most users know, using the 'type = *' specification allowing, notably for binary variables (as your a.<col> dummies above) allowing asymmetric behavior which maybe quite important in "rare event" and similar cases. Martin > ------------------------------------- > David L Carlson > Associate Professor of Anthropology > Texas A&M University > College Station, TX 77840-4352 > -----Original Message----- > From: Li, Yan [mailto:yan...@ibi.com] > Sent: Tuesday, August 6, 2013 9:36 AM > To: dcarl...@tamu.edu; r-help@r-project.org > Subject: RE: [R] algorithm for clustering categorical data > H David and other R helpers, > If I rescale the numerical fields to [0,1] and represent the > categorical fields to 1:k, which is the same starting point as > Gower's measure, but I use Euclidean distance instead of Gower's > distance to do k-means clustering. How much is the difference? What > is the draw back? > Thanks you, > Yan > -----Original Message----- > From: David Carlson [mailto:dcarl...@tamu.edu] > Sent: Thursday, August 01, 2013 12:08 PM > To: Li, Yan; r-help@r-project.org > Subject: RE: [R] algorithm for clustering categorical data > Read up on Gower's Distance measures (available in the ecodist > package) which can combine numeric and categorical data. You didn't > give us any information about how you numerically transformed the > categorical variables, but the usual approach is to create indicator > variables that code presence/absence for each category within a > categorical variable. Different variances between variables can be > reduced by standardizing the variables. > ------------------------------------- > David L Carlson > Associate Professor of Anthropology > Texas A&M University > College Station, TX 77840-4352 > -----Original Message----- > From: r-help-boun...@r-project.org > [mailto:r-help-boun...@r-project.org] On Behalf Of Li, Yan > Sent: Thursday, August 1, 2013 11:00 AM > To: r-help@r-project.org > Subject: [R] algorithm for clustering categorical data > Hi All, > Does anyone know what algorithm for clustering categorical > variables? R packages? Which is the best? > If a data has both numeric and categorical data, what is the best > clustering algorithm to use and R package? > I tried numeric transformation of all categorical fields and doing > clustering afterwards. But the transformed fields have values from > 1...10, and my other fields is in a bigger scale: > 10000-...This will make the categorical fields has less effect on > the distance calculation... > Thank you! > Yan ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.