Hi, >Hi. > >Le ven. 28 févr. 2020 à 05:04, chentao...@qq.com <chentao...@qq.com> a écrit : >> >> >> >> >> -------------- >> chentao...@qq.com >> >Hi. >> > >> >Le jeu. 27 févr. 2020 à 06:17, chentao...@qq.com <chentao...@qq.com> a >> >écrit : >> >> >> >> Hi, >> >> >> >> > [...] >> >> >> >> >> >> >> >> Do you mean I should fire a JIRA issue about >> >> >> >> reuse "centroidOf" and "chooseInitialCenters", >> >> >> >> then start a PR and a disscuss about "ClusterUtils"? >> >> >> >> And then start the PR of "MiniBatchKMeansClusterer" after all >> >> >> >> done? >> >> >> > >> >> >> >I cannot guarantee that the whole process will be streamlined. >> >> >> >In effect, you can work on multiple branches (one for each >> >> >> >prospective PR). >> >> >> >I'd say that you should start by describing (here on the ML) the >> >> >> >rationale for "ClusterUtils" (and contrast it with say, a common >> >> >> >base class). >> >> >> >[Only when the design has been agreed on, a JIRA issue to >> >> >> >implement it should be created in order to track the actual >> >> >> >coding work).] >> >> >> >> >> >> OK, I think we should start from here: >> >> >> >> >> >> The method "centroidOf" and "chooseInitialCenters" in >> >> >> KMeansPlusPlusClusterer >> >> >> could be reused by other KMeans Clusterer like >> >> >>MiniBatchKMeansClusterer which I want to implement. >> >> >> >> >> >> There are two solution for reuse "centroidOf" and >> >> >> "chooseInitialCenters": >> >> >> 1. Extract a abstract class for KMeans Clusterer named >> >> >> "AbstractKMeansClusterer", >> >> >> and move "centroidOf" and "chooseInitialCenters" as protected >> >> >>methods in it; >> >> >> the EmptyClusterStrategy and related logic can also move to the >> >> >>"AbstractKMeansClusterer". >> >> >> 2. Create a static utility class, and move "centroidOf" and >> >> >> "chooseInitialCenters" in it, >> >> >> and some useful clustering method like predict(Predict which cluster >> >> >>is best for a specified point) can put in it. >> >> >> >> >> > >> >> >At first sight, I prefer option 1. >> >> >Indeed, o.a things "chooseInitialCenters" is a method that is of no >> >> >interest to >> >> >users of the functionality (and so should not be part of the "public" >> >> >API). >> >> >> >> Persuasive explain, and I agree with you, that extract a abstract class >> >> for KMeans is better. >> >> And how can we make a conclusion? >> >> --------------------------------------------- >> >> >> >> Mention the "public API", I suppose there should be a series of >> >> "CentroidInitializer", >> >> that "chooseInitialCenters" with various of algorithms. >> >> The k-means++ cluster algorithm is a special implementation of k-means >> >> which initialize cluster centers with k-means++ algorithm. >> >> So if there is a "CentroidInitializer", "KMeansPlusPlusClusterer" can be >> >> "KMeansClusterer" >> >> with a "KMeansPlusPlusCentroidInitializer" strategy. >> >> When "KMeansClusterer" initialize with a "RandomCentroidInitializer", it >> >> is a common k-means. >> >> >> >> ---------------------------------------------------------- >> >> >Method "centroidOf" looks generally useful. Shouldn't it be part of >> >> >the "Cluster" >> >> >interface? What is the difference with method "getCenter" (define by >> >> >class >> >> >"CentroidCluster")? >> >> >> >> My understanding is,: >> >> * "Cluster" is a data class that carry the result of a clustering, >> >> "getCenter" is just a get method of CentroidCluster for get the value of >> >> a center point. >> >> * "Cluster[er]" is a (Interface of )algorithm that classify points to >> >>sets of Cluster. >> >> * "CentroidCluster" is the result of a group of special Clusterer >> >>algorithm like k-means, >> >> "centroidOf" is a specific logic to calculate the center point for a >> >>collection of points. >> >> [Instead the DBScan cluster algorithm dose not care about the "Centroid"] >> >> >> >> So, "centroidOf" may be a method of "CentroidCluster[er]"(not exists yet), >> >> but different with "CentroidCluster.getCenter". >> > >> >I may be missing something about the existing design, >> >but it seems strange that "CentroidCluster" is initialized >> >with a given "center", yet it is possible to add points after >> >initialization (which IIUC would invalidate the "center"). >> >> The "centroidOf" could be part of "CentroidCluster", >> but I think the existsing desgin was focus on decouple of >> "DistanceMeasure"("centroidOf" depends on it) and "CentroidCluster". > >I don't see why we need both "Cluster" and "CentroidCluster". >Indeed, as suggested before, the "center" can be computed >from a "Cluster", but does not need to be stored in it.
Typical usecase for a List<CentroidCluster> is, when we need classify a new point, calculate the distance of the new point to each CentroidCluster.center is the simplest, and the center should be cached. > >> >> Center recalculate often happens in each iteration of k-means Clustering, >> always with points reassign to clusters. >> We often use k-means as two pharse: >> Pharse 1: Training, classify thousands of points to set of clusters. >> Pharse 2: Predict, predict which cluster is best for a new point, >> or add a new point to the best cluster in ClusterSet, > >Method "cluster" returns a "List<Cluster>"; there is no need for a >new "ClusterSet" class. >Also, IIUC the centers can be collected into a "List<Clusterable>", >so that the association is through the index into the list(s). > >> but we never update the cluster center until next retraining. > >IMO, that's the reason for *not*" storing the center (in such a >mutable instance). > >> >> The KMeansPlusPlusClusterer and other Cluster algorithm in "commons-math" >> just design for pharse "Training", >> it is clearly if we can consider "CentroidCluster" as a pure data class just >> for k-means clustering result. > >See above. > >Discussing the existing design further, I think that the "cluster" method >should >rather be: >---CUT--- >public List<Cluster<T>> cluster(Collections<T> points, DistanceMeasure dist) >---CUT--- > >And, similarly, >---CUT--- >@FunctionalInterface >public interface ClusterFinder<T extends Clusterable> { > public int indexOf(T point, List<Cluster<T> clusters, DistanceMeasure >dist); >} >---CUT--- I think there is a balance between flexibility, efficiency and usability. The DistanceMeasure should be assigned once for a kmeans, and can not change in exists clusters. So DistanceMeasure should be a property of Cluster, but redundant for a List of Cluster. Consider the Cluster fewly use separately, the DistanceMeasure should be a readonly property of List<Cluster>, If there is a ClusterSet, so we can serialize and unserialize the Clusters correctly and with no redundancy. As a comparison, dl4j's design is easy to use for predict, especial the ClusterSet but leak of flexibility(stroke coupling to nd4j) and efficiency(complexity on nd4j implemention): https://github.com/eclipse/deeplearning4j/tree/master/deeplearning4j/deeplearning4j-nearestneighbors-parent/nearestneighbor-core/src/main/java/org/deeplearning4j/clustering > >> If we want the cluster result useful enough for parse "Predict", >> the result of "KMeansPlusPlusClusterer.cluster" should return a >>"ClusterSet": >> ```java >> public interface ClusterSet<T extends Clusterable> extends Collection<T> { >> // Retrun the cluster which the point should belong to. >> Cluster predict(T point); >> // Add a point to best cluster. >> void addPoint(T point); >> } >> ``` > >This "ClusterSet" seems less flexible than a "List<Cluster>". "ClusterSet" is useful for predict as I mentioned above. > >> And "centroidOf"(just used in clustering iteration) can move up into a >> abstract class like "CenroidClusterer". > >It seems that this method could be useful for users too. > May be "centroidOf" is useful, but now it is just used in kmeans. >Best, >Gilles > >> >It would seem that "center" should be a property computed >> >from the contents of "Cluster" e.g.: >> > >> >@FunctionalInterface >> >public interface ClusterCenterComputer<T extends Clusterable> { >> > T centroidOf(Cluster<T> cluster); >> >} >> > >> >Regards, >> >Gilles >> > > >--------------------------------------------------------------------- >To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >For additional commands, e-mail: dev-h...@commons.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org