Dear R-users,

I am trying to run kmeans on a set comprising of 100 observations. But R
somehow can not figure out the true underlying groups, although other
software such as Jmp, MINITAB are producing the desired result.

Following is a brief example of what I am doing.

library(stringdist)
test=c('hematolgy','hemtology','oncology','onclogy',
'oncolgy','dermatolgy','dermatoloy','dematology',
'neurolog','nerology','neurolgy','nerology')

dis=stringdistmatrix(test,test, method = "lv")

set.seed(123)
cl=kmeans(dis,4)


grp_cl=vector('list',4)

for(i in 1:4)
{
    grp_cl[[i]]=test[which(cl$cluster==i)]
}
grp_cl

[[1]]
[1] "oncology" "onclogy"

[[2]]
[1] "neurolog" "nerology" "neurolgy" "nerology"

[[3]]
[1] "oncolgy"

[[4]]
[1] "hematolgy"  "hemtology"  "dermatolgy" "dermatoloy" "dematology"

In the above example, the 'test' variable consists of a set of
terminologies with various typos and I am trying to group the similar types
of words based on their string distance. Unfortunately kmeans is not able
to replicate the following result that the other software are able to
produce.
[[1]]
[1] "oncology" "onclogy"  "oncolgy"

[[2]]
[1] "neurolog" "nerology" "neurolgy" "nerology"

[[3]]
[1] "dermatolgy" "dermatoloy" "dematology"

[[4]]
[1] "hematolgy"  "hemtology"


Does anyone know if there is a way out, I have heard from a lot of people
that multivariate analysis in R does not produce the desired result most of
the time. Any help is really appreciated.


Thanks in advance.


Cassie

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to