Thank you for both your help saving me a a lot of time searching for the right technique. I have another question regarding clustering:
My data set occasionally has only one cluster, meaning that clustering is not required in these occasional cases. Example: list <- c(767, 773, 766, 772, 778, 777, 777, 758, 764, 771, 773, 768, 770, 752, 762, 769, 770, 768, 763) Here the data will cluster in two groups (e.g. with kmeans) however, it in fact only is one. I might have the wrong clustering technique here; is there a method that considers more closely the effect size between the groups and can be used to make a decision if clustering should be done or not. This relates to my former question about the statistical test. Is there a different metric for these clustering techniques or is there one clustering technique that uses some form of a test that allows me to detects such cases (e.g. only to cluster if differences between the groups have large effect sizes) and skips otherwise? I have a feeling that what I am asking here is probably more likely a pre-processing step... any ideas where I could find a technique that allows me to find such cases? Ralf On Wed, May 5, 2010 at 1:35 PM, Achim Zeileis <achim.zeil...@uibk.ac.at> wrote: > On Wed, 5 May 2010, Ralf B wrote: > >> Hi R friends, >> >> I am posting this question even though I know that the nature of it is >> closer to general stats than R. Please let me know if you are aware of >> a list for general statistical questions: >> >> I am looking for a simple method to distinguish two groups of data in >> a long vector of numbers: >> >> list <- c(1,2,3,2,3,2,3,4,3,2,3,4,3,2,400,340,3,2,4,5,6,4,3,6,4,5,3) >> >> I would like to 'learn' that 400,430 are different numbers by using a >> simple approach. > > It seems that you want to cluster the data. There are, of course, loads of > clustering algorithms around, see e.g., > http://CRAN.R-project.org/view=Cluster > > In this simple example a standard hierarchical clustering approach shows you > what you're after. > > ## data > list <- c(1,2,3,2,3,2,3,4,3,2,3,4,3,2,400,340,3,2,4,5,6,4,3,6,4,5,3) > > ## cluster using Ward method for Euclidian distances > hc <- hclust(dist(list, method = "euclidian"), method = "ward") > plot(hc) > hc > > ## cut into two clusters > split(list, cutree(hc, k = 2)) > > hth, > Z > >> The outcome of processing 'list' should therefore be: >> >> listA <- c(1,2,3,2,3,2,3,4,3,2,3,4,3,2,3,2,4,5,6,4,3,6,4,5,3) >> listB <- c(400,340) >> >> I am thinking a non-parametric test since I have no knowledge of the >> underlying distribution. The numbers are time differences between two >> actions recorded from a the same person over time. Because the data >> was obtained from the same person I would naturally tend to use >> Wilcoxon Signed-Rank test. Any thoughts on that? >> >> Are there any R packages that would process such a vector and use >> non-parametric methods to split or divide groups based on their >> values? Could clustering be the answer given that I already know that >> I always have two groups with a significant difference between the >> two. >> >> Thanks a lot, >> Ralf >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.