Hi Gilles: I really appricate for your patient to help me to solve the mail sending problem, I try to set the only setting about charset "Use Unicode" for this mail. I have created a pull request: https://github.com/apache/commons-math/pull/118 And a comparsion between KMeans and MiniBatchKMeans by using the "org.apache.commons.math4.userguide.ClusterAlgorithmComparison":
------------------ Original ------------------ From: "Gilles Sadowski";<gillese...@gmail.com>; Date: Feb 19, 2020 To: "dev@commons.apache.org"<dev@commons.apache.org>; Subject: Re: [math] Discuss: New feature MiniBatchKMeansClusterer [Re-sending post so that it is attached to the original thread.] Hello. Le mar. 18 févr. 2020 à 04:49, 陈 涛 <chentao...@hotmail.com> a écrit : > > Hi Gilles: > > I really do not know if anyone received my last mail, no one replay me for > a long time so I send it again and copy to you with another email system. Sorry for the delay. :-} > > > Some remarks: > > > * I didn't get why the "KMeansPlusPlusCentroidInitializer" class > > does not call the existing "KMeansPlusPlusClusterer". > > Code seems duplicated: As there is a case for reuse, the currently > > "private" centroid initialization code should be factored out. > > This is alpha version for discuss the "MiniBatchKMeansClusterer" algorithm, I guess that you mean that we discuss your implementation of the algorithm referenced in the Javadoc. > and when "centroidOf" is extracted from "KMeansPlusPlusClusterer", > > the "KMeansPlusPlusClusterer" is not "KMeansPlusPlusClusterer" anymore but > "KMeansClusterer", I don't follow. > > this is a significant change, so I did not reactor it. Significant changes are welcome (since the next release will contain other major changes anyways) if they improve the code base (like e.g. reducing code duplication). > > > > > * In "CentroidInitializer", I'd rename "chooseCentroids" to emphasize > > that a computation is performed (as opposed to selecting from an > > existing data structure). I think I'd prefer "selectCentroids". > > It is extract from "KMeansPlusPlusClusterer.centroidOf", should remain be > "centroidOf"? I don't understand. It would be easier if you create a pull request, so that we can clearly see what codes are added/removed/changed. > > The subclass "RandomCentroidInitializer" and > "KMeansPlusPlusCentroidInitializer" indicate the algorithm used. > > > > * Not convinced that there should be so many constructors (in most > > cases, it makes no sense to pick default values that are likely to > > be heavily application-dependent. > > I can add more constructors. No, the constructors with default values clutter the API, for no obvious gain IMHO. [If the default values make sense, they must be documented.] > > I'd like a builder class more than constructors, but does not meet the > historical code style. Now is the time for improving the API. It would be quite helpful to create a report on JIRA with "sub-tasks" for all such API proposed changes. > > * Input should generally be validated: e.g. the maximum number of > > iterations should not be changed unwittingly; rather, an exception > > should be raised if the user passed a negative value. > > Thanks for your advices, I will improve these. > > > Could be nice to illustrate (not just with a picture, but in a table > > with entries average over several runs) the differences in result > > between the implementations, using various setups (number of > > clusters, stopping criterion, etc.). > > I will make more tests, include benchmarks. > > It is a challenge for me to generate the various kinds of test data, > > could anybody supply me the test data of this comparsion: > http://commons.apache.org/proper/commons-math/userguide/ml.html They are generated programmatically; code is here: https://gitbox.apache.org/repos/asf?p=commons-math.git;a=tree;f=src/userguide/java/org/apache/commons/math4/userguide [I've just updated the codes so that they compiles and run using the new dependencies (see the "README" file).] > > "MT_64" is probably not the best default. And this is one of the > > parameters from which there should not be a default IMO. > > I will do more tests You don't need to test the generators; users should choose by themselves (from those provided in "Commons RNG"). > > > [Note: there are spurious characters in your message (see e.g. the > > paragraph quoted just above) that make it difficult to read.] > > I had well format my mail in my mail box, it may been changed by the Mail > List service. > > I will try various kinds of mail editor. It will helpful if you told me which > mail editor is work well with the ML. It's probably an encoding thing (setting it to "UTF-8" should be fine). Best regards, Gilles > > [...] --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org