[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541516#comment-14541516
 ] 

Peter Schrott edited comment on FLINK-1731 at 5/13/15 7:33 AM:
---------------------------------------------------------------

Hi [~chiwanpark],

the thing is, to fit the model, the KMeans uses two datasets. One is the 
training data, the other are the initial centroids. The initial centroids are 
used to create the appropriated clusters on the training dataset. These 
clusters define the fitted model.

This means, the {{fit}}-method should take two attributes at that point. This 
is the reason why I suggested to use the parameter map for passing the initial 
centroids. The training dataset will be passed as argument to the 
{{fit}}-method, equally to the CoCoA implementation.

The test dataset will be applied to the trained model afterwards.


was (Author: peedeex21):
Hi [~chiwanpark],

the thing is, to fit the model, the KMeans uses two datasets. One is the 
training data, the other are the initial centroids. 

This means, the {{fit}}-method should take two attributes at that point. This 
is the reason why I suggested to use the parameter map for passing the initial 
centroids. The training dataset will be passed as argument to the 
{{fit}}-method, equally to the CoCoA implementation.



> Add kMeans clustering algorithm to machine learning library
> -----------------------------------------------------------
>
>                 Key: FLINK-1731
>                 URL: https://issues.apache.org/jira/browse/FLINK-1731
>             Project: Flink
>          Issue Type: New Feature
>          Components: Machine Learning Library
>            Reporter: Till Rohrmann
>            Assignee: Alexander Alexandrov
>              Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to