Good morning, I am analyzing a dataset composed by 364 subjects and 13 binary variables (0,1 = absence,presence). I am testing possible association (co-presence) of my variables. To do this, I was trying with cluster analysis.
My main interest is to check for the significance of the obtained clusters. First, I tried with the pvclust() function, by using method.hclust="ward" and method.dist="binary". Altoghether it works (clusters and significance obtained). However, I'm not convinced by the distance matrix. Association between variables are indeed different from results obtained in PAST by using Ward on a Jaccard matrix (that should be ok for binary data). Moreover, when I try to obtain a Jaccard matrix in R from my data, by using the Vegan package mydistance<-vegdist(t(data),method="jaccard") I receive the following error message: Error in rowSums(x, na.rm = TRUE) : 'x' must be numeric below an subset from my dataset: variable1 variable2 variable3 variable4 variable5 variable6 variable7 variable8 variable9 variable10 variable11 variable12 variable13 case1 0 0 0 0 0 1 0 0 1 1 0 0 0 case2 0 0 0 0 0 1 0 NA NA 1 0 0 0 case3 0 0 0 0 0 1 0 0 1 1 0 0 0 case4 1 0 0 0 0 1 0 1 0 1 0 0 0 case5 0 0 0 0 0 1 0 0 1 1 0 0 0 case6 0 1 0 0 0 1 0 1 0 1 0 0 0 case7 0 1 0 0 0 1 0 0 1 1 0 0 0 case8 0 0 0 0 0 1 0 1 0 1 0 0 0 case9 0 0 0 0 0 1 0 1 0 1 0 0 0 case10 0 0 0 0 0 1 0 0 1 1 0 0 0 case11 1 0 0 1 0 1 1 1 0 1 0 0 0 case12 0 0 0 1 1 0 1 1 0 1 0 0 0 ..... So, my questions are the following: Is the Jaccard index a good strategy for my kind of data? Is binary distance used in pvclust is theoretically more correct? Is there any alternative to pvclust for testing the significance of my clusters? Thanks in advance Marco [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.