Re: Re: [math] Discuss: New feature MiniBatchKMeansClusterer

[email protected] Thu, 27 Feb 2020 20:04:17 -0800

--------------
[email protected]
>Hi.
>
>Le jeu. 27 févr. 2020 à 06:17, [email protected] <[email protected]> a écrit :
>>
>> Hi,
>>
>> > [...]
>> >> >>
>> >> >> Do you mean I should fire a JIRA issue about reuse&nbsp;"centroidOf" 
>> >> >> and "chooseInitialCenters",
>> >> >> then start a PR and a disscuss about "ClusterUtils"?
>> >> >> And then&nbsp;start the PR of "MiniBatchKMeansClusterer" after all 
>> >> >> done?
>> >> >
>> >> >I cannot guarantee that the whole process will be streamlined.
>> >> >In effect, you can work on multiple branches (one for each
>> >> >prospective PR).
>> >> >I'd say that you should start by describing (here on the ML) the
>> >> >rationale for "ClusterUtils" (and contrast it with say, a common
>> >> >base class).
>> >> >[Only when the design has been agreed on,  a JIRA issue to
>> >> >implement it should be created in order to track the actual
>> >> >coding work).]
>> >>
>> >> OK, I think we should start from here:
>> >>
>> >> The method "centroidOf"  and "chooseInitialCenters" in 
>> >> KMeansPlusPlusClusterer
>> >>  could be reused by other KMeans Clusterer like MiniBatchKMeansClusterer 
>> >>which I want to implement.
>> >>
>> >> There are two solution for reuse "centroidOf"  and "chooseInitialCenters":
>> >> 1. Extract a abstract class for KMeans Clusterer named 
>> >> "AbstractKMeansClusterer",
>> >>  and move "centroidOf"  and "chooseInitialCenters" as protected methods 
>> >>in it;
>> >>  the EmptyClusterStrategy and related logic can also move to the 
>> >>"AbstractKMeansClusterer".
>> >> 2. Create a static utility class, and move "centroidOf"  and 
>> >> "chooseInitialCenters" in it,
>> >>  and some useful clustering method like predict(Predict which cluster is 
>> >>best for a specified point) can put in it.
>> >>
>> >
>> >At first sight, I prefer option 1.
>> >Indeed, o.a things "chooseInitialCenters" is a method that is of no 
>> >interest to
>> >users of the functionality (and so should not be part of the "public" API).
>>
>> Persuasive explain, and I agree with you, that extract a abstract class for 
>> KMeans is better.
>> And how can we make a conclusion?
>> ---------------------------------------------
>>
>> Mention the "public API", I suppose there should be a series of 
>> "CentroidInitializer",
>>  that "chooseInitialCenters" with various of algorithms.
>> The k-means++ cluster algorithm is a special implementation of k-means
>>  which initialize cluster centers with k-means++ algorithm.
>> So if there is a "CentroidInitializer", "KMeansPlusPlusClusterer" can be 
>> "KMeansClusterer"
>>  with a "KMeansPlusPlusCentroidInitializer" strategy.
>> When "KMeansClusterer" initialize with a "RandomCentroidInitializer", it is 
>> a common k-means.
>>
>> ----------------------------------------------------------
>> >Method "centroidOf" looks generally useful.  Shouldn't it be part of
>> >the "Cluster"
>> >interface?  What is the difference with method "getCenter" (define by class
>> >"CentroidCluster")?
>>
>> My understanding is,:
>>  * "Cluster" is a data class that carry the result of a clustering,
>> "getCenter" is just a get method of CentroidCluster for get the value of a 
>> center point.
>>  * "Cluster[er]" is a (Interface of )algorithm that classify points to sets 
>>of Cluster.
>>  * "CentroidCluster" is the result of a group of special Clusterer algorithm 
>>like k-means,
>>  "centroidOf" is a specific logic to calculate the center point for a 
>>collection of points.
>> [Instead the DBScan cluster algorithm dose not care about the "Centroid"]
>>
>> So, "centroidOf" may be a method of "CentroidCluster[er]"(not exists yet),
>>  but different with "CentroidCluster.getCenter".
>
>I may be missing something about the existing design,
>but it seems strange that "CentroidCluster" is initialized
>with a given "center", yet it is possible to add points after
>initialization (which IIUC would invalidate the "center"). 

The "centroidOf" could be part of "CentroidCluster",
but I think the existsing desgin was focus on decouple of 
"DistanceMeasure"("centroidOf" depends on it) and "CentroidCluster".

Center recalculate often happens in each iteration of k-means Clustering, 
always with points reassign to clusters.
We often use k-means as two pharse:
Pharse 1: Training, classify thousands of points to set of clusters.
Pharse 2: Predict, predict which cluster is best for a new point,
or add a new point to the best cluster in ClusterSet,
but we never update the cluster center until next retraining.

The KMeansPlusPlusClusterer and other Cluster algorithm in "commons-math" just 
design for pharse "Training",
it is clearly if we can consider "CentroidCluster" as a pure data class just 
for k-means clustering result.

If we want the cluster result useful enough for parse "Predict",
 the result of "KMeansPlusPlusClusterer.cluster" should return a  "ClusterSet":
```java
public interface ClusterSet<T extends Clusterable> extends Collection<T> {
  // Retrun the cluster which the point should belong to.
  Cluster predict(T point);
  // Add a point to best cluster.
  void addPoint(T point);
}
```
And "centroidOf"(just used in clustering iteration) can move up into a abstract 
class like "CenroidClusterer".

>It would seem that "center" should be a property computed
>from the contents of "Cluster" e.g.:
>
>@FunctionalInterface
>public interface ClusterCenterComputer<T extends Clusterable> {
>    T centroidOf(Cluster<T> cluster);
>}
>
>Regards,
>Gilles
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [email protected]
>For additional commands, e-mail: [email protected]
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: Re: [math] Discuss: New feature MiniBatchKMeansClusterer

Reply via email to