Hi. Le ven. 28 févr. 2020 à 05:04, chentao...@qq.com <chentao...@qq.com> a écrit : > > > > > -------------- > chentao...@qq.com > >Hi. > > > >Le jeu. 27 févr. 2020 à 06:17, chentao...@qq.com <chentao...@qq.com> a écrit > >: > >> > >> Hi, > >> > >> > [...] > >> >> >> > >> >> >> Do you mean I should fire a JIRA issue about reuse "centroidOf" > >> >> >> and "chooseInitialCenters", > >> >> >> then start a PR and a disscuss about "ClusterUtils"? > >> >> >> And then start the PR of "MiniBatchKMeansClusterer" after all > >> >> >> done? > >> >> > > >> >> >I cannot guarantee that the whole process will be streamlined. > >> >> >In effect, you can work on multiple branches (one for each > >> >> >prospective PR). > >> >> >I'd say that you should start by describing (here on the ML) the > >> >> >rationale for "ClusterUtils" (and contrast it with say, a common > >> >> >base class). > >> >> >[Only when the design has been agreed on, a JIRA issue to > >> >> >implement it should be created in order to track the actual > >> >> >coding work).] > >> >> > >> >> OK, I think we should start from here: > >> >> > >> >> The method "centroidOf" and "chooseInitialCenters" in > >> >> KMeansPlusPlusClusterer > >> >> could be reused by other KMeans Clusterer like > >> >> MiniBatchKMeansClusterer which I want to implement. > >> >> > >> >> There are two solution for reuse "centroidOf" and > >> >> "chooseInitialCenters": > >> >> 1. Extract a abstract class for KMeans Clusterer named > >> >> "AbstractKMeansClusterer", > >> >> and move "centroidOf" and "chooseInitialCenters" as protected methods > >> >> in it; > >> >> the EmptyClusterStrategy and related logic can also move to the > >> >> "AbstractKMeansClusterer". > >> >> 2. Create a static utility class, and move "centroidOf" and > >> >> "chooseInitialCenters" in it, > >> >> and some useful clustering method like predict(Predict which cluster > >> >> is best for a specified point) can put in it. > >> >> > >> > > >> >At first sight, I prefer option 1. > >> >Indeed, o.a things "chooseInitialCenters" is a method that is of no > >> >interest to > >> >users of the functionality (and so should not be part of the "public" > >> >API). > >> > >> Persuasive explain, and I agree with you, that extract a abstract class > >> for KMeans is better. > >> And how can we make a conclusion? > >> --------------------------------------------- > >> > >> Mention the "public API", I suppose there should be a series of > >> "CentroidInitializer", > >> that "chooseInitialCenters" with various of algorithms. > >> The k-means++ cluster algorithm is a special implementation of k-means > >> which initialize cluster centers with k-means++ algorithm. > >> So if there is a "CentroidInitializer", "KMeansPlusPlusClusterer" can be > >> "KMeansClusterer" > >> with a "KMeansPlusPlusCentroidInitializer" strategy. > >> When "KMeansClusterer" initialize with a "RandomCentroidInitializer", it > >> is a common k-means. > >> > >> ---------------------------------------------------------- > >> >Method "centroidOf" looks generally useful. Shouldn't it be part of > >> >the "Cluster" > >> >interface? What is the difference with method "getCenter" (define by > >> >class > >> >"CentroidCluster")? > >> > >> My understanding is,: > >> * "Cluster" is a data class that carry the result of a clustering, > >> "getCenter" is just a get method of CentroidCluster for get the value of a > >> center point. > >> * "Cluster[er]" is a (Interface of )algorithm that classify points to > >> sets of Cluster. > >> * "CentroidCluster" is the result of a group of special Clusterer > >> algorithm like k-means, > >> "centroidOf" is a specific logic to calculate the center point for a > >> collection of points. > >> [Instead the DBScan cluster algorithm dose not care about the "Centroid"] > >> > >> So, "centroidOf" may be a method of "CentroidCluster[er]"(not exists yet), > >> but different with "CentroidCluster.getCenter". > > > >I may be missing something about the existing design, > >but it seems strange that "CentroidCluster" is initialized > >with a given "center", yet it is possible to add points after > >initialization (which IIUC would invalidate the "center"). > > The "centroidOf" could be part of "CentroidCluster", > but I think the existsing desgin was focus on decouple of > "DistanceMeasure"("centroidOf" depends on it) and "CentroidCluster".
I don't see why we need both "Cluster" and "CentroidCluster". Indeed, as suggested before, the "center" can be computed from a "Cluster", but does not need to be stored in it. > > Center recalculate often happens in each iteration of k-means Clustering, > always with points reassign to clusters. > We often use k-means as two pharse: > Pharse 1: Training, classify thousands of points to set of clusters. > Pharse 2: Predict, predict which cluster is best for a new point, > or add a new point to the best cluster in ClusterSet, Method "cluster" returns a "List<Cluster>"; there is no need for a new "ClusterSet" class. Also, IIUC the centers can be collected into a "List<Clusterable>", so that the association is through the index into the list(s). > but we never update the cluster center until next retraining. IMO, that's the reason for *not*" storing the center (in such a mutable instance). > > The KMeansPlusPlusClusterer and other Cluster algorithm in "commons-math" > just design for pharse "Training", > it is clearly if we can consider "CentroidCluster" as a pure data class just > for k-means clustering result. See above. Discussing the existing design further, I think that the "cluster" method should rather be: ---CUT--- public List<Cluster<T>> cluster(Collections<T> points, DistanceMeasure dist) ---CUT--- And, similarly, ---CUT--- @FunctionalInterface public interface ClusterFinder<T extends Clusterable> { public int indexOf(T point, List<Cluster<T> clusters, DistanceMeasure dist); } ---CUT--- > If we want the cluster result useful enough for parse "Predict", > the result of "KMeansPlusPlusClusterer.cluster" should return a > "ClusterSet": > ```java > public interface ClusterSet<T extends Clusterable> extends Collection<T> { > // Retrun the cluster which the point should belong to. > Cluster predict(T point); > // Add a point to best cluster. > void addPoint(T point); > } > ``` This "ClusterSet" seems less flexible than a "List<Cluster>". > And "centroidOf"(just used in clustering iteration) can move up into a > abstract class like "CenroidClusterer". It seems that this method could be useful for users too. Best, Gilles > >It would seem that "center" should be a property computed > >from the contents of "Cluster" e.g.: > > > >@FunctionalInterface > >public interface ClusterCenterComputer<T extends Clusterable> { > > T centroidOf(Cluster<T> cluster); > >} > > > >Regards, > >Gilles > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org