Re: [R] algorithm for clustering categorical data

Li, Yan Tue, 06 Aug 2013 11:43:54 -0700

Thanks David, This is very useful!

-----Original Message-----
From: David Carlson [mailto:dcarl...@tamu.edu] 
Sent: Tuesday, August 06, 2013 11:27 AM
To: Li, Yan; r-help@r-project.org
Subject: RE: [R] algorithm for clustering categorical data


What do you mean by representing the categorical fields by 1:k?

a <- c("red", "green", "blue", "orange", "yellow")

becomes

a <- c(1, 2, 3, 4, 5)

That guarantees your results are worthless unless your categories have an 
inherent order (e.g. tiny, small, medium, big, giant).
Otherwise it should be four (k-1) indicator/dummy variables (e.g.):

a.red <- c(1, 0, 0, 0, 0)
a.green <- c(0, 1, 0, 0, 0)
a.blue <- c(0, 0, 1, 0, 0)
a.orange <- c(0, 0, 0, 1, 0)

Then you can use Euclidean distance.

-------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77840-4352


-----Original Message-----
From: Li, Yan [mailto:yan...@ibi.com]
Sent: Tuesday, August 6, 2013 9:36 AM
To: dcarl...@tamu.edu; r-help@r-project.org
Subject: RE: [R] algorithm for clustering categorical data

H David and other R helpers,

If I rescale the numerical fields to [0,1] and represent the categorical fields 
to 1:k, which is the same starting point as Gower's measure, but I use 
Euclidean distance instead of Gower's distance to do k-means clustering. How 
much is the difference? What is the draw back? 

Thanks you,
Yan

-----Original Message-----
From: David Carlson [mailto:dcarl...@tamu.edu]
Sent: Thursday, August 01, 2013 12:08 PM
To: Li, Yan; r-help@r-project.org
Subject: RE: [R] algorithm for clustering categorical data

Read up on Gower's Distance measures (available in the ecodist
package) which can combine numeric and categorical data. You didn't give us any 
information about how you numerically transformed the categorical variables, 
but the usual approach is to create indicator variables that code 
presence/absence for each category within a categorical variable. Different 
variances between variables can be reduced by standardizing the variables.

-------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77840-4352

-----Original Message-----
From: r-help-boun...@r-project.org
[mailto:r-help-boun...@r-project.org] On Behalf Of Li, Yan
Sent: Thursday, August 1, 2013 11:00 AM
To: r-help@r-project.org
Subject: [R] algorithm for clustering categorical data

Hi All,

Does anyone know what algorithm for clustering categorical variables? R 
packages? Which is the best?

If a data has both numeric and categorical data, what is the best clustering 
algorithm to use and R package?

I tried numeric transformation of all categorical fields  and doing clustering 
afterwards. But the transformed fields have values from 1...10, and my other 
fields is in a bigger scale:
10000-...This will make the categorical fields has less effect on the distance 
calculation...

Thank you!
Yan

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] algorithm for clustering categorical data

Reply via email to