Re: [R] K-means recluster data with given cluster centers

Christian Hennig Mon, 11 Jan 2010 04:47:19 -0800

That kmeans returns an error if there is an empty cluster is a bit of anuisance.

It should not be too difficult to get rid off the kmeans function for whatyou call "reclustering". You could write your own function that assigns

every point of the new data to the closest initial center. That should be

relatively easy and does the same thing, if I understand correctly what youwant.

I don't comment on whether it makes sense what you attempt to do, whichentirely depends on the aim of your analysis (and on what you mean by"cluster in the same way"), but an alternative could beto cluster the initial data by mclustBIC in library mclust and to use theresulting clusters as training data in mclustDA.


Cheers,
Christian


On Mon, 11 Jan 2010, t.peter.muel...@gmx.net wrote:

K-means recluster data with given cluster centers

Dear R user,

I have several large data sets. Over time additional new data sets will be 
created.
I want to cluster all the data in a similar/ identical way with the k-means 
algorithm.

With the first data set I will find my cluster centers and save the cluster 
centers to a file [1].
This first data set is huge, it is guarantied that cluster centers will 
converge.

Afterwards I load my cluster centers and cluster via k-means all other datasets 
with the same cluster centers [2].

I tried this but now I'm getting in the reclustering step following error 
message:
"Error: empty cluster: try a better set of initial centers"

That one of the clusters is empty (has no datapoint) should not be aproblem. This can happen because the new data sets can be smaller. Whatam I doing wrong? Is there a other way to cluster new data in the sameway like the old datasets?


Thanks
Peter


1: R code to find cluster center and save them to file
  #---INITIAL CLUSTERING TO FIND CLUSTER CENTERS
  # LOAD LIB
  library(cluster)

  # LOAD DATA
  data_unclean <- read.table("dataset1.dat")
  data.matrix<-as.matrix(data_unclean,"any")

  # CLUSTER
  Nclust <- 100 # amount cluster centers
  Imax <- 200 # amount of iteration for convergence of clustering
  set.seed(100) # set seed of random nr generator
  init <- sample(dim(data.matrix)[1], Nclust) # this is the initial Nclust 
prototypes
  km <- kmeans(data.matrix, centers=data.matrix[init,], iter.max=Imax)

  # WRITE OUT CLUSTER CENTERS
  km$centers # print cluster center (columns: dim component; rows: clusters)
  km$size # print amount of data in each cluster
  clusterCenters=km$centers
  save(file="clusterCenters.RData", list='clusterCenters') # Beispiel
  write.table(km$centers, file = "clusterCenters.dat", sep = ",", col.names= 
FALSE, row.names= FALSE)


2: R code to recluster new data
  #---RECLUSTER NEW DATA WITH GIVEN CLUSTER CENTERS
  # LOAD LIB, SET PARAMETER
  library(cluster)
  loopStart="0"
  loopEnd="10"

  # LOAD CLUSTER CENTER
  load("clusterCenters.RData") # load cluster centers

  # LOOP OVER TRAJ AND RECLUSTER THEM
  for(ii in loopStart:loopEnd){
       # DEFINE FILENAME
       #print(paste("test",ii,sep=""))
       filenameInput=paste("dataset",ii,"dat",sep="")
       filenameOutput=paste("dataset",ii,"datClusters",sep="")
       print(filenameInput)
       print(filenameOutput)

       # LOAD DATA
       data_unclean <- read.table(filenameInput)
       data.matrix<-as.matrix(data_unclean,"any")

       # RECLUSTER DATA
       kmRecluster <- kmeans(data.matrix, centers=clusterCenters, iter.max=1)
       kmRecluster$size

       # WRITE OUT CLUSTERS FOR EACH DATA
       write.table(kmRecluster$cluster, file = filenameOutput, sep = ",", 
col.names= FALSE, row.names= FALSE)
  }

--
Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 -
sicherer, schneller und einfacher! http://portal.gmx.net/de/go/chbrowser

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] K-means recluster data with given cluster centers

Reply via email to