Dear R-users, I am trying to run kmeans on a set comprising of 100 observations. But R somehow can not figure out the true underlying groups, although other software such as Jmp, MINITAB are producing the desired result.
Following is a brief example of what I am doing. library(stringdist) test=c('hematolgy','hemtology','oncology','onclogy', 'oncolgy','dermatolgy','dermatoloy','dematology', 'neurolog','nerology','neurolgy','nerology') dis=stringdistmatrix(test,test, method = "lv") set.seed(123) cl=kmeans(dis,4) grp_cl=vector('list',4) for(i in 1:4) { grp_cl[[i]]=test[which(cl$cluster==i)] } grp_cl [[1]] [1] "oncology" "onclogy" [[2]] [1] "neurolog" "nerology" "neurolgy" "nerology" [[3]] [1] "oncolgy" [[4]] [1] "hematolgy" "hemtology" "dermatolgy" "dermatoloy" "dematology" In the above example, the 'test' variable consists of a set of terminologies with various typos and I am trying to group the similar types of words based on their string distance. Unfortunately kmeans is not able to replicate the following result that the other software are able to produce. [[1]] [1] "oncology" "onclogy" "oncolgy" [[2]] [1] "neurolog" "nerology" "neurolgy" "nerology" [[3]] [1] "dermatolgy" "dermatoloy" "dematology" [[4]] [1] "hematolgy" "hemtology" Does anyone know if there is a way out, I have heard from a lot of people that multivariate analysis in R does not produce the desired result most of the time. Any help is really appreciated. Thanks in advance. Cassie [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.