Hello. >>> [...] >>> >> For machine learning centroid cluster algorithm, we often use is >>> >> Calinsk-iHarabasz score to evaluate which algorithm or how many >>> >> centers is >>> >> best for a dataset. >>> >> The python lib sklearn implements Calinsk-iHarabasz as >>> >> sklearn.metrics.calinski_harabasz_score. >>> > >>> >Could you post a reference (most of our documentation points >>> >to "Wikipedia" or "MathWorld")? >>> >>> "Calinsk-iHarabasz" is the most popular evaluator for Centriod Clusters >>> as I know. >>> I just read the code of sklearn, and think it easy to implement. >>> https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html >>> https://www.tandfonline.com/doi/abs/10.1080/03610927408827101 >> >>Thanks; the original reference is quite fine too. >> >>> > >>> >> I think there should be a CalinskiHarabaszClusterEvaluator in commons >>> >> math: >>> > >>> >At first sight, the approach would be to define a functional >>> >interface (with the "score" method). >>> >Then an "enum" that would be a factory of evaluators, along >>> >the lines of what has been done in "Commons RNG" (see class >>> >"RandomSource"[1]). >>> >>> I just inherit the design of "ClusterEvaluator", >>> and I think change the design of exists API is another question. >> >>Not really: IMHO we should not pile feature on top of an >>API that might have shortcomings. In particular, the fact >>that the new calls' constructor calls the parent's constructor >>with "null" looks problematic to me. > > IMHO "ClusterEvaluator" should be a interface(not just a functional > interface) as I described below.
How about renaming it something more explicit and enforce unambiguous semantics for the ordering? E.g. @FunctionalInterface public interface ClusterRanking<T extends Clusterable> { /** * Computes the rank (higher is better). * * @param clusters Clusters to be evaluated. * @return the rank of the provided {@code clusters}. */ double compute(List<? extends Cluster<T>> clusters); } >> >>> > >>> >> ```java >>> >> package org.apache.commons.math4.ml.clustering.evaluation; >>> >> >>> >> import org.apache.commons.math4.ml.clustering.Cluster; >>> >> import org.apache.commons.math4.ml.clustering.Clusterable; >>> >> >>> >> import java.util.List; >>> >> >>> >> public class CalinskiHarabaszClusterEvaluator<T extends Clusterable> >>> >> extends >>> >> ClusterEvaluator<T> { >>> >> @Override >>> >> public double score(List<? extends Cluster<T>> clusters) { >>> >> //TODO: Implement the Calinski-Harabasz Score algorithm >>> >> return 0; >>> >> } >>> >> >>> >> @Override >>> >> public boolean isBetterScore(double score1, double score2) { >>> >> return score1 > score2; >>> >> } >>> > >>> >This method does not seem very useful. >> >>I've now seen how this used by "MultiKMeansPlusPlusClusterer". >>However, I wonder why the "Multi" feature is only available for that >>implementation... >> > > SumOfClusterVariances is just one of the ClusterEvaluator algorithm. > IMHO it is necessary to tell the user which score is better for each > ClusterEvaluator, Not if we enforce semantics (cf. above). > but smaller is better cannot be the default implementation > of ClusterEvaluator.isBetterScore, > and the name"isBetterScore" may be ambigous(Is the second score better than > the first?) The convention is set by the documentation. However, the current API could be made simpler with the above proposal. > > The ClusterEvaluator can be a interface: > Solution 1: Compatible to old API: > ```java > public interface ClusterEvaluator<T extends Clusterable>{ > double score(Collection<? extends Cluster<T>> clusters); > // Keep old API for compatible > boolean isBetterScore(double score1, double score2); > } > ``` > Solution 2: Use a explicit function name > ```java > public interface ClusterEvaluator<T extends Clusterable>{ > double score(Collection<? extends Cluster<T>> clusters); > // Use a explicit name > boolean isScoreImproved(double originScore, double newScore); > } > ``` Solution 3 is "ClusterRanking". In cases where the reference algorithm would assume the other convention (i.e. "lower is better"), the implementation is required to apply a conversion (e.g. return the opposite). >>> >> } >>> >> ``` >>> >> >>> >> The code can be implemented by read the algorithm documents, >>> >> or translate from python sklearn.metrics.calinski_harabasz_score. >>> > >>> >What's the license of that code? >>> >>> The sklearn is under the BSD license. >> >>OK; no problem[1] to have claimed inspiration then. ;-) >> >>Please note that, for tracking purpose, your PR should be tied >>to a JIRA report, and the issue's identifier should prefix the >>commit message. > > This PR is for discussion, I will create a JIRA issue. Proposals and conceptual discussions are posted here (in this case, you proposed to contribute a new "ClusterEvaluator".) If there are no objections, a JIRA report should then be filed, to which you can attach PRs (this has the advantage of being automatically tracked). > But I still do not know how we make a conclusion. Implementations details will be discussed in JIRA comments (not GitHub as far as I'm concerned). [Issues that have more far-reaching consequences can be posted back here.] > >>The PR is also not in sync with current "master" branch. > > Which branch should I pull? "master". You could probably "rebase" your branch on it; try $ git rebase master Regards, Gilles >>> [...] --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org