Le ven. 6 mars 2020 à 14:35, chentao...@qq.com <chentao...@qq.com> a écrit : > > Hi, > > >Hello. > > > >2020-03-06 9:48 UTC+01:00, chentao...@qq.com <chentao...@qq.com>: > >> Hi, > >> For machine learning centroid cluster algorithm, we often use is > >> Calinsk-iHarabasz score to evaluate which algorithm or how many centers is > >> best for a dataset. > >> The python lib sklearn implements Calinsk-iHarabasz as > >> sklearn.metrics.calinski_harabasz_score. > > > >Could you post a reference (most of our documentation points > >to "Wikipedia" or "MathWorld")? > > "Calinsk-iHarabasz" is the most popular evaluator for Centriod Clusters as I > know. > I just read the code of sklearn, and think it easy to implement. > https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html > https://www.tandfonline.com/doi/abs/10.1080/03610927408827101
Thanks; the original reference is quite fine too. > > > >> I think there should be a CalinskiHarabaszClusterEvaluator in commons math: > > > >At first sight, the approach would be to define a functional > >interface (with the "score" method). > >Then an "enum" that would be a factory of evaluators, along > >the lines of what has been done in "Commons RNG" (see class > >"RandomSource"[1]). > > I just inherit the design of "ClusterEvaluator", > and I think change the design of exists API is another question. Not really: IMHO we should not pile feature on top of an API that might have shortcomings. In particular, the fact that the new calls' constructor calls the parent's constructor with "null" looks problematic to me. > > > >> ```java > >> package org.apache.commons.math4.ml.clustering.evaluation; > >> > >> import org.apache.commons.math4.ml.clustering.Cluster; > >> import org.apache.commons.math4.ml.clustering.Clusterable; > >> > >> import java.util.List; > >> > >> public class CalinskiHarabaszClusterEvaluator<T extends Clusterable> > >> extends > >> ClusterEvaluator<T> { > >> @Override > >> public double score(List<? extends Cluster<T>> clusters) { > >> //TODO: Implement the Calinski-Harabasz Score algorithm > >> return 0; > >> } > >> > >> @Override > >> public boolean isBetterScore(double score1, double score2) { > >> return score1 > score2; > >> } > > > >This method does not seem very useful. I've now seen how this used by "MultiKMeansPlusPlusClusterer". However, I wonder why the "Multi" feature is only available for that implementation... > >> } > >> ``` > >> > >> The code can be implemented by read the algorithm documents, > >> or translate from python sklearn.metrics.calinski_harabasz_score. > > > >What's the license of that code? > > The sklearn is under the BSD license. OK; no problem[1] to have claimed inspiration then. ;-) Please note that, for tracking purpose, your PR should be tied to a JIRA report, and the issue's identifier should prefix the commit message. The PR is also not in sync with current "master" branch. Regards, Gilles [1] http://www.apache.org/legal/resolved.html#category-a > I think math ml reference the sklearn so much, > for example: org.apache.commons.math4.userguide.ClusterAlgorithmComparison > > > > >Regards, > >Gilles > > > >[1] > >https://commons.apache.org/proper/commons-rng/commons-rng-simple/javadocs/api-1.3/org/apache/commons/rng/simple/RandomSource.html --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org