[R] clustering of binary data

marco milella Thu, 06 Dec 2012 11:51:32 -0800

Good morning,
I am analyzing a dataset composed by 364 subjects and 13 binary variables
(0,1 = absence,presence).
I am testing possible association (co-presence) of my variables. To do
this, I was trying with cluster analysis.


My main interest is to check for the significance of the obtained clusters.

First, I tried with the pvclust() function, by using method.hclust="ward"
and method.dist="binary". Altoghether it works (clusters and significance
obtained). However, I'm not convinced by the distance matrix. Association
between variables are indeed different from results obtained in PAST by
using Ward on a Jaccard matrix (that should be ok for binary data).
Moreover, when I try to obtain a Jaccard matrix in R from my data, by using
the Vegan package

mydistance<-vegdist(t(data),method="jaccard")

 I receive the following error message:

Error in rowSums(x, na.rm = TRUE) : 'x' must be numeric


below an subset from my dataset:

       variable1 variable2 variable3 variable4 variable5 variable6 variable7
variable8 variable9 variable10 variable11 variable12 variable13  case1 0 0 0
0 0 1 0 0 1 1 0 0 0  case2 0 0 0 0 0 1 0 NA NA 1 0 0 0  case3 0 0 0 0 0 1 0
0 1 1 0 0 0  case4 1 0 0 0 0 1 0 1 0 1 0 0 0  case5 0 0 0 0 0 1 0 0 1 1 0 0
0  case6 0 1 0 0 0 1 0 1 0 1 0 0 0  case7 0 1 0 0 0 1 0 0 1 1 0 0 0  case8 0
0 0 0 0 1 0 1 0 1 0 0 0  case9 0 0 0 0 0 1 0 1 0 1 0 0 0  case10 0 0 0 0 0 1
0 0 1 1 0 0 0  case11 1 0 0 1 0 1 1 1 0 1 0 0 0  case12 0 0 0 1 1 0 1 1 0 1
0 0 0  .....













So, my questions are the following: Is the Jaccard index a good strategy
for my kind of data? Is binary distance used in pvclust is theoretically
more correct? Is there any alternative to pvclust for testing the
significance of my clusters?

Thanks in advance
Marco

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] clustering of binary data

Reply via email to