Hello. > > [...] > >>>> > >>>> I have started 2 PRs to solve the problem you metioned. > >>>> > >>>> About the "CentroidInitializer" I have a new idea: > >>>> Move CentroidInitializers as inner classes of "KMeansPlusPlusCluster", > >>>> and add a construct parameter and a property "useKMeansPlusPlus" to > >>>> "KMeansPlusPlusCluster": > >>>> ```java > >>>> // Add "useKMeansPlusPlus" to "KMeansPlusPlusClusterer" > >>>> public class KMeansPlusPlusClusterer<T extends Clusterable> extends > >>>> Clusterer<T> { > >>>> public KMeansPlusPlusClusterer(final int k, final int maxIterations, > >>>> final DistanceMeasure measure, > >>>> final UniformRandomProvider random, > >>>> final EmptyClusterStrategy emptyStrategy, > >>>> + final useKMeansPlusPlus) { > >>>> // ... > >>>> - // Use K-means++ to choose the initial centers. > >>>> - this.centroidInitializer = new > >>>> KMeansPlusPlusCentroidInitializer(measure, random); > >>>> + this.useKMeansPlusPlus = useKMeansPlusPlus; > >>>> } > >>> > >>>What if one comes up with a third way to initialize the centroids? > >>>If you can ensure that there is no other initialization procedure, > >>>a boolean is fine, if not, we could still make the existing procedures > >>>package-private (e.g. moving them in as classes defined within > >>>"KMeansPlusPlusClusterer". > >> > >> As I know the k-means has two center initialize methods, random and > >> k-means++ so far, > >> use a boolean to choose which method to use is good enough for current use, > > > >Indeed but the question is: Will it be future-proof? > >[We don't want to break compatibility (and have to release a > >new major version of the library) for having overlooked the > >question which I'm asking.] > > > >> but there are two situations use need to implement the center initialize > >> method themselves: > >> 1. The Commoans Maths's implements is not good enough; > > > >As this is FLOSS, the understanding (IMO) is in that case users > >would contribute back to fix what needs be. > > > >> 2. There are new center initialize methods. > > > >So, that would be arguing against using a boolean (?). > >Cf. above (about breaking compatibility). > The name "KMeansPlusPlusClusterer" decide it won't support new initialize > methods. > Whether "KMeansPlusPlusClusterer" compatible random
Sorry, but I don't understand what you mean. > >>> > >>>Also, in the current implementation of "KMeansPlusPlusClusterer", the > >>>initialization is not configurable ("KMeansPlusPlusCentroidInitializer"). > >>>Perhaps we don't want to depart from the original (?) algorithm; if so, > >>>the new constructor could be made protected (thus simplifying the API). > >> > >> k-means++ is the recommend center initialize method for now days, > >> show we let user to fall back to random choose centers, that is a question > >> need to tradeoff. > > > >Is there any advantage to "random" init vs "kmeans" init? > >E.g. is "random" faster, yet would give similar clustering results? > > k-means++ is an improvement of k-means(random init). Do I understand correctly that to call the algorithm KMeans++, the only change (from "Kmeans") is the centroid initialization (using "KMeans" vs using "random")? If so, it would be a contradiction (name vs functionality) to allow "random" in a class called "KMeansPlusPlusClusterer". So it seems the alternatives are: (1) Drop "random" init (and keep "KMeansPlusPlusClusterer") if you are positive that this functionality won"t be missed. (2) Allow flexibility for init (and rename "KMeansPlusPlusClusterer" to "KMeansClusterer"), adding a boolean argument to the constructor: "doKMeansPlusPlus" to select "random" (false) or "kmeans" (true). (3) Allow flexibility by passing a function to perform the init. According to what you wrote, (1) is the better alternative right now. Option (3) might (perhaps!) become a nice enhancement, but only after we settle on a better design for the fundamental classes ("Cluster", "ClusterSet", "ClusterResult", ...). > The performance loss by using "k-means++" is tinily relative to entire > clustering process. > I think keep old state is OK(make chooseInitialCenters package-private), > but now "CentroidInitializer" is a middle state. +1 if you mean that you'll remove, for now, the package "initialization". Please submit a PR. > > > >> Show we make the API simple or rich? > > > >I'd keep it simple until we fix the (IMHO) more important > >issues of thread-safety and sparse data. > > Some Personal opinion about "Sparse data": > Math ML do not use a martix for clustering, but list of vectors(double > arrays), > this is flexible and simple (This is the main attraction of Math ML to me). > > Operations on sparse martixs can be faster than common martixs, > but not that faster on sparse vectors, but the performance improvement will > bring complexity. > > Anyhow if we want to start this optimization, I think there are 2 ways: > 1. Define a abstract Martix, and make Cluser API work on Martix, this is a > big change. > 2. Defina a abstract Vector, and replace double[] with Vector, implement a > series of operations on Vector, include Sparse Vector. Unless I'm missing some point, this is going astray; my mentioning of "sparse data" is related to issue MATH-1330.[1] IIUC the concern there was that it should be possible to process the data without forcing the instantiation of a "double[]". Best, Gilles [1] https://issues.apache.org/jira/browse/MATH-1330 > > [...] --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org