[ https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541516#comment-14541516 ]
Peter Schrott edited comment on FLINK-1731 at 5/13/15 7:33 AM: --------------------------------------------------------------- Hi [~chiwanpark], the thing is, to fit the model, the KMeans uses two datasets. One is the training data, the other are the initial centroids. The initial centroids are used to create the appropriated clusters on the training dataset. These clusters define the fitted model. This means, the {{fit}}-method should take two attributes at that point. This is the reason why I suggested to use the parameter map for passing the initial centroids. The training dataset will be passed as argument to the {{fit}}-method, equally to the CoCoA implementation. The test dataset will be applied to the trained model afterwards. was (Author: peedeex21): Hi [~chiwanpark], the thing is, to fit the model, the KMeans uses two datasets. One is the training data, the other are the initial centroids. This means, the {{fit}}-method should take two attributes at that point. This is the reason why I suggested to use the parameter map for passing the initial centroids. The training dataset will be passed as argument to the {{fit}}-method, equally to the CoCoA implementation. > Add kMeans clustering algorithm to machine learning library > ----------------------------------------------------------- > > Key: FLINK-1731 > URL: https://issues.apache.org/jira/browse/FLINK-1731 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library > Reporter: Till Rohrmann > Assignee: Alexander Alexandrov > Labels: ML > > The Flink repository already contains a kMeans implementation but it is not > yet ported to the machine learning library. I assume that only the used data > types have to be adapted and then it can be more or less directly moved to > flink-ml. > The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better > implementation because the improve the initial seeding phase to achieve near > optimal clustering. It might be worthwhile to implement kMeans||. > Resources: > [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf > [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)