Re: [math] Discuss: New feature MiniBatchKMeansClusterer

chentao...@qq.com Wed, 26 Feb 2020 21:18:17 -0800

Hi,

> [...]
>> >>
>> >> Do you mean I should fire a JIRA issue about reuse&nbsp;"centroidOf" and 
>> >> "chooseInitialCenters",
>> >> then start a PR and a disscuss about "ClusterUtils"?
>> >> And then&nbsp;start the PR of "MiniBatchKMeansClusterer" after all done?
>> >
>> >I cannot guarantee that the whole process will be streamlined.
>> >In effect, you can work on multiple branches (one for each
>> >prospective PR).
>> >I'd say that you should start by describing (here on the ML) the
>> >rationale for "ClusterUtils" (and contrast it with say, a common
>> >base class).
>> >[Only when the design has been agreed on,  a JIRA issue to
>> >implement it should be created in order to track the actual
>> >coding work).]
>>
>> OK, I think we should start from here:
>>
>> The method "centroidOf"  and "chooseInitialCenters" in 
>> KMeansPlusPlusClusterer
>>  could be reused by other KMeans Clusterer like MiniBatchKMeansClusterer 
>>which I want to implement.
>>
>> There are two solution for reuse "centroidOf"  and "chooseInitialCenters":
>> 1. Extract a abstract class for KMeans Clusterer named 
>> "AbstractKMeansClusterer",
>>  and move "centroidOf"  and "chooseInitialCenters" as protected methods in 
>>it;
>>  the EmptyClusterStrategy and related logic can also move to the 
>>"AbstractKMeansClusterer".
>> 2. Create a static utility class, and move "centroidOf"  and 
>> "chooseInitialCenters" in it,
>>  and some useful clustering method like predict(Predict which cluster is 
>>best for a specified point) can put in it.
>>
>
>At first sight, I prefer option 1.
>Indeed, o.a things "chooseInitialCenters" is a method that is of no interest to
>users of the functionality (and so should not be part of the "public" API).


Persuasive explain, and I agree with you, that extract a abstract class for 
KMeans is better.
And how can we make a conclusion?
---------------------------------------------

Mention the "public API", I suppose there should be a series of 
"CentroidInitializer",
 that "chooseInitialCenters" with various of algorithms.
The k-means++ cluster algorithm is a special implementation of k-means
 which initialize cluster centers with k-means++ algorithm.
So if there is a "CentroidInitializer", "KMeansPlusPlusClusterer" can be 
"KMeansClusterer"
 with a "KMeansPlusPlusCentroidInitializer" strategy.
When "KMeansClusterer" initialize with a "RandomCentroidInitializer", it is a 
common k-means.

----------------------------------------------------------
>Method "centroidOf" looks generally useful.  Shouldn't it be part of
>the "Cluster"
>interface?  What is the difference with method "getCenter" (define by class
>"CentroidCluster")? 

My understanding is,:
 * "Cluster" is a data class that carry the result of a clustering,
"getCenter" is just a get method of CentroidCluster for get the value of a 
center point.
 * "Cluster[er]" is a (Interface of )algorithm that classify points to sets of 
Cluster.
 * "CentroidCluster" is the result of a group of special Clusterer algorithm 
like k-means, 
 "centroidOf" is a specific logic to calculate the center point for a 
collection of points.
[Instead the DBScan cluster algorithm dose not care about the "Centroid"]

So, "centroidOf" may be a method of "CentroidCluster[er]"(not exists yet),
 but different with "CentroidCluster.getCenter".

>
>Regards,
>Gilles
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>For additional commands, e-mail: dev-h...@commons.apache.org
>
>

Re: [math] Discuss: New feature MiniBatchKMeansClusterer

Reply via email to