Thanks David, This is very useful! -----Original Message----- From: David Carlson [mailto:dcarl...@tamu.edu] Sent: Tuesday, August 06, 2013 11:27 AM To: Li, Yan; r-help@r-project.org Subject: RE: [R] algorithm for clustering categorical data
What do you mean by representing the categorical fields by 1:k? a <- c("red", "green", "blue", "orange", "yellow") becomes a <- c(1, 2, 3, 4, 5) That guarantees your results are worthless unless your categories have an inherent order (e.g. tiny, small, medium, big, giant). Otherwise it should be four (k-1) indicator/dummy variables (e.g.): a.red <- c(1, 0, 0, 0, 0) a.green <- c(0, 1, 0, 0, 0) a.blue <- c(0, 0, 1, 0, 0) a.orange <- c(0, 0, 0, 1, 0) Then you can use Euclidean distance. ------------------------------------- David L Carlson Associate Professor of Anthropology Texas A&M University College Station, TX 77840-4352 -----Original Message----- From: Li, Yan [mailto:yan...@ibi.com] Sent: Tuesday, August 6, 2013 9:36 AM To: dcarl...@tamu.edu; r-help@r-project.org Subject: RE: [R] algorithm for clustering categorical data H David and other R helpers, If I rescale the numerical fields to [0,1] and represent the categorical fields to 1:k, which is the same starting point as Gower's measure, but I use Euclidean distance instead of Gower's distance to do k-means clustering. How much is the difference? What is the draw back? Thanks you, Yan -----Original Message----- From: David Carlson [mailto:dcarl...@tamu.edu] Sent: Thursday, August 01, 2013 12:08 PM To: Li, Yan; r-help@r-project.org Subject: RE: [R] algorithm for clustering categorical data Read up on Gower's Distance measures (available in the ecodist package) which can combine numeric and categorical data. You didn't give us any information about how you numerically transformed the categorical variables, but the usual approach is to create indicator variables that code presence/absence for each category within a categorical variable. Different variances between variables can be reduced by standardizing the variables. ------------------------------------- David L Carlson Associate Professor of Anthropology Texas A&M University College Station, TX 77840-4352 -----Original Message----- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Li, Yan Sent: Thursday, August 1, 2013 11:00 AM To: r-help@r-project.org Subject: [R] algorithm for clustering categorical data Hi All, Does anyone know what algorithm for clustering categorical variables? R packages? Which is the best? If a data has both numeric and categorical data, what is the best clustering algorithm to use and R package? I tried numeric transformation of all categorical fields and doing clustering afterwards. But the transformed fields have values from 1...10, and my other fields is in a bigger scale: 10000-...This will make the categorical fields has less effect on the distance calculation... Thank you! Yan [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.