You really should read the instructions before complaining. The manual page for kmeans clearly states that it works on "a numeric matrix of data." That is not what you provided. You gave it a distance matrix. The function pam() will work with a distance matrix if it is properly labeled as such, but stringdistmatrix() does not label the output as a distance matrix:
dis <- stringdistmatrix(test, test, method = "lv") dis [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [1,] 0 2 6 6 5 2 3 2 6 5 4 5 [2,] 2 0 4 5 5 4 4 2 4 3 4 3 [3,] 6 4 0 1 1 7 7 5 5 3 5 3 [4,] 6 5 1 0 2 7 8 6 6 4 5 4 [5,] 5 5 1 2 0 6 7 6 6 4 4 4 [6,] 2 4 7 7 6 0 1 2 7 5 5 5 [7,] 3 4 7 8 7 1 0 2 6 5 6 5 [8,] 2 2 5 6 6 2 2 0 5 4 5 4 [9,] 6 4 5 6 6 7 6 5 0 2 2 2 [10,] 5 3 3 4 4 5 5 4 2 0 2 0 [11,] 4 4 5 5 4 5 6 5 2 2 0 2 [12,] 5 3 3 4 4 5 5 4 2 0 2 0 require(cluster) # Works once you have installed it. cl <- pam(dis, 4, diss=TRUE) # Note you must tell pam() that this is a distance matrix. print(paste(test, "-", cl$clustering)) [1] "hematolgy - 1" "hemtology - 1" "oncology - 2" "onclogy - 2" [5] "oncolgy - 2" "dermatolgy - 3" "dermatoloy - 3" "dematology - 1" [9] "neurolog - 4" "nerology - 4" "neurolgy - 4" "nerology - 4" The only apparent error is dermatology which is combined with hematology but if you look at row 8 of the above distance matrix, you will see that the Levenshtein distance (the option you chose) has the value 2 for hematology, hemtology, dermatolgy, and dermatology. You may want to choose a distance metric that places greater weight on the initial letter. Peer reviewed research publications, as opposed to idle gossip, confirm the accuracy of R. -----Original Message----- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Peter Langfelder Sent: Monday, April 28, 2014 11:44 PM To: cassie jones Cc: r-help@r-project.org Subject: Re: [R] Fwd: problem with kmeans You are using the wrong algorithm. You want Partitioning around Medoids (PAM, function pam), not k-means. PAM is also known as k-medoids, which is where the confusion may come from. use library(cluster) cl = pam(dis, 4) and see if you get what you want. HTH, Peter On Mon, Apr 28, 2014 at 9:15 PM, cassie jones <cassiejone...@gmail.com> wrote: > Dear R-users, > > I am trying to run kmeans on a set comprising of 100 observations. But R > somehow can not figure out the true underlying groups, although other > software such as Jmp, MINITAB are producing the desired result. > > Following is a brief example of what I am doing. > > library(stringdist) > test=c('hematolgy','hemtology','oncology','onclogy', > 'oncolgy','dermatolgy','dermatoloy','dematology', > 'neurolog','nerology','neurolgy','nerology') > > dis=stringdistmatrix(test,test, method = "lv") > > set.seed(123) > cl=kmeans(dis,4) > > > grp_cl=vector('list',4) > > for(i in 1:4) > { > grp_cl[[i]]=test[which(cl$cluster==i)] > } > grp_cl > > [[1]] > [1] "oncology" "onclogy" > > [[2]] > [1] "neurolog" "nerology" "neurolgy" "nerology" > > [[3]] > [1] "oncolgy" > > [[4]] > [1] "hematolgy" "hemtology" "dermatolgy" "dermatoloy" "dematology" > > In the above example, the 'test' variable consists of a set of > terminologies with various typos and I am trying to group the similar types > of words based on their string distance. Unfortunately kmeans is not able > to replicate the following result that the other software are able to > produce. > [[1]] > [1] "oncology" "onclogy" "oncolgy" > > [[2]] > [1] "neurolog" "nerology" "neurolgy" "nerology" > > [[3]] > [1] "dermatolgy" "dermatoloy" "dematology" > > [[4]] > [1] "hematolgy" "hemtology" > > > Does anyone know if there is a way out, I have heard from a lot of people > that multivariate analysis in R does not produce the desired result most of > the time. Any help is really appreciated. > > > Thanks in advance. > > > Cassie > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.