Re: Re: [math] Discuss: New feature MiniBatchKMeansClusterer

chentao...@qq.com Wed, 04 Mar 2020 19:51:26 -0800

Hi,

>Hi.
>
>Le ven. 28 févr. 2020 à 05:04, chentao...@qq.com <chentao...@qq.com> a écrit :
>>
>>
>>
>>
>> --------------
>> chentao...@qq.com
>> >Hi.
>> >
>> >Le jeu. 27 févr. 2020 à 06:17, chentao...@qq.com <chentao...@qq.com> a 
>> >écrit :
>> >>
>> >> Hi,
>> >>
>> >> > [...]
>> >> >> >>
>> >> >> >> Do you mean I should fire a JIRA issue about 
>> >> >> >> reuse&nbsp;"centroidOf" and "chooseInitialCenters",
>> >> >> >> then start a PR and a disscuss about "ClusterUtils"?
>> >> >> >> And then&nbsp;start the PR of "MiniBatchKMeansClusterer" after all 
>> >> >> >> done?
>> >> >> >
>> >> >> >I cannot guarantee that the whole process will be streamlined.
>> >> >> >In effect, you can work on multiple branches (one for each
>> >> >> >prospective PR).
>> >> >> >I'd say that you should start by describing (here on the ML) the
>> >> >> >rationale for "ClusterUtils" (and contrast it with say, a common
>> >> >> >base class).
>> >> >> >[Only when the design has been agreed on,  a JIRA issue to
>> >> >> >implement it should be created in order to track the actual
>> >> >> >coding work).]
>> >> >>
>> >> >> OK, I think we should start from here:
>> >> >>
>> >> >> The method "centroidOf"  and "chooseInitialCenters" in 
>> >> >> KMeansPlusPlusClusterer
>> >> >>  could be reused by other KMeans Clusterer like 
>> >> >>MiniBatchKMeansClusterer which I want to implement.
>> >> >>
>> >> >> There are two solution for reuse "centroidOf"  and 
>> >> >> "chooseInitialCenters":
>> >> >> 1. Extract a abstract class for KMeans Clusterer named 
>> >> >> "AbstractKMeansClusterer",
>> >> >>  and move "centroidOf"  and "chooseInitialCenters" as protected 
>> >> >>methods in it;
>> >> >>  the EmptyClusterStrategy and related logic can also move to the 
>> >> >>"AbstractKMeansClusterer".
>> >> >> 2. Create a static utility class, and move "centroidOf"  and 
>> >> >> "chooseInitialCenters" in it,
>> >> >>  and some useful clustering method like predict(Predict which cluster 
>> >> >>is best for a specified point) can put in it.
>> >> >>
>> >> >
>> >> >At first sight, I prefer option 1.
>> >> >Indeed, o.a things "chooseInitialCenters" is a method that is of no 
>> >> >interest to
>> >> >users of the functionality (and so should not be part of the "public" 
>> >> >API).
>> >>
>> >> Persuasive explain, and I agree with you, that extract a abstract class 
>> >> for KMeans is better.
>> >> And how can we make a conclusion?
>> >> ---------------------------------------------
>> >>
>> >> Mention the "public API", I suppose there should be a series of 
>> >> "CentroidInitializer",
>> >>  that "chooseInitialCenters" with various of algorithms.
>> >> The k-means++ cluster algorithm is a special implementation of k-means
>> >>  which initialize cluster centers with k-means++ algorithm.
>> >> So if there is a "CentroidInitializer", "KMeansPlusPlusClusterer" can be 
>> >> "KMeansClusterer"
>> >>  with a "KMeansPlusPlusCentroidInitializer" strategy.
>> >> When "KMeansClusterer" initialize with a "RandomCentroidInitializer", it 
>> >> is a common k-means.
>> >>
>> >> ----------------------------------------------------------
>> >> >Method "centroidOf" looks generally useful.  Shouldn't it be part of
>> >> >the "Cluster"
>> >> >interface?  What is the difference with method "getCenter" (define by 
>> >> >class
>> >> >"CentroidCluster")?
>> >>
>> >> My understanding is,:
>> >>  * "Cluster" is a data class that carry the result of a clustering,
>> >> "getCenter" is just a get method of CentroidCluster for get the value of 
>> >> a center point.
>> >>  * "Cluster[er]" is a (Interface of )algorithm that classify points to 
>> >>sets of Cluster.
>> >>  * "CentroidCluster" is the result of a group of special Clusterer 
>> >>algorithm like k-means,
>> >>  "centroidOf" is a specific logic to calculate the center point for a 
>> >>collection of points.
>> >> [Instead the DBScan cluster algorithm dose not care about the "Centroid"]
>> >>
>> >> So, "centroidOf" may be a method of "CentroidCluster[er]"(not exists yet),
>> >>  but different with "CentroidCluster.getCenter".
>> >
>> >I may be missing something about the existing design,
>> >but it seems strange that "CentroidCluster" is initialized
>> >with a given "center", yet it is possible to add points after
>> >initialization (which IIUC would invalidate the "center").
>>
>> The "centroidOf" could be part of "CentroidCluster",
>> but I think the existsing desgin was focus on decouple of 
>> "DistanceMeasure"("centroidOf" depends on it) and "CentroidCluster".
>
>I don't see why we need both "Cluster" and "CentroidCluster".
>Indeed, as suggested before, the "center" can be computed
>from a "Cluster", but does not need to be stored in it.


Typical usecase for a List<CentroidCluster> is, when we need classify a new 
point, 
calculate the distance of the new point to each CentroidCluster.center is the 
simplest,
and the center should be cached.

>
>>
>> Center recalculate often happens in each iteration of k-means Clustering,
>> always with points reassign to clusters.
>> We often use k-means as two pharse:
>> Pharse 1: Training, classify thousands of points to set of clusters.
>> Pharse 2: Predict, predict which cluster is best for a new point,
>> or add a new point to the best cluster in ClusterSet,
>
>Method "cluster" returns a "List<Cluster>"; there is no need for a
>new "ClusterSet" class.
>Also, IIUC the centers can be collected into a "List<Clusterable>",
>so that the association is through the index into the list(s).
>
>> but we never update the cluster center until next retraining.
>
>IMO, that's the reason for *not*" storing the center (in such a
>mutable instance).
>
>>
>> The KMeansPlusPlusClusterer and other Cluster algorithm in "commons-math" 
>> just design for pharse "Training",
>> it is clearly if we can consider "CentroidCluster" as a pure data class just 
>> for k-means clustering result.
>
>See above.
>
>Discussing the existing design further, I think that the "cluster" method 
>should
>rather be:
>---CUT---
>public List<Cluster<T>> cluster(Collections<T> points, DistanceMeasure dist)
>---CUT---
>
>And, similarly,
>---CUT---
>@FunctionalInterface
>public interface ClusterFinder<T extends Clusterable> {
>    public int indexOf(T point, List<Cluster<T> clusters, DistanceMeasure 
>dist);
>}
>---CUT--- 

I think there is a balance between flexibility, efficiency and usability.
The DistanceMeasure should be assigned once for a kmeans, and can not change in 
exists clusters.
So DistanceMeasure should be a property of Cluster, but redundant for a List of 
Cluster.

Consider the Cluster fewly use separately, the DistanceMeasure should be a 
readonly property of List<Cluster>,
If there is a ClusterSet, so we can serialize and unserialize the Clusters 
correctly and with no redundancy.

As a comparison, dl4j's design is easy to use for predict, especial the 
ClusterSet
but leak of  flexibility(stroke coupling to nd4j) and efficiency(complexity on 
nd4j implemention):
https://github.com/eclipse/deeplearning4j/tree/master/deeplearning4j/deeplearning4j-nearestneighbors-parent/nearestneighbor-core/src/main/java/org/deeplearning4j/clustering

>
>> If we want the cluster result useful enough for parse "Predict",
>>  the result of "KMeansPlusPlusClusterer.cluster" should return a  
>>"ClusterSet":
>> ```java
>> public interface ClusterSet<T extends Clusterable> extends Collection<T> {
>>   // Retrun the cluster which the point should belong to.
>>   Cluster predict(T point);
>>   // Add a point to best cluster.
>>   void addPoint(T point);
>> }
>> ```
>
>This  "ClusterSet" seems less flexible than a "List<Cluster>". 

"ClusterSet" is useful for predict as I mentioned above.

>
>> And "centroidOf"(just used in clustering iteration) can move up into a 
>> abstract class like "CenroidClusterer".
>
>It seems that this method could be useful for users too.
> 

May be "centroidOf" is useful, but now it is just used in kmeans.

>Best,
>Gilles
>
>> >It would seem that "center" should be a property computed
>> >from the contents of "Cluster" e.g.:
>> >
>> >@FunctionalInterface
>> >public interface ClusterCenterComputer<T extends Clusterable> {
>> >    T centroidOf(Cluster<T> cluster);
>> >}
>> >
>> >Regards,
>> >Gilles
>> >
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>For additional commands, e-mail: dev-h...@commons.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: Re: [math] Discuss: New feature MiniBatchKMeansClusterer

Reply via email to