Re: [math]Discuss: There should be a CalinskiHarabaszClusterEvaluator in ml package

[email protected] Fri, 06 Mar 2020 22:05:08 -0800

Hi,

>Le ven. 6 mars 2020 à 14:35, [email protected] <[email protected]> a écrit :
>>
>> Hi,
>>
>> >Hello.
>> >
>> >2020-03-06 9:48 UTC+01:00, [email protected] <[email protected]>:
>> >> Hi,
>> >>     For machine learning centroid cluster algorithm, we often use is
>> >> Calinsk-iHarabasz score to evaluate which algorithm or how many centers is
>> >> best for a dataset.
>> >>     The python lib sklearn implements Calinsk-iHarabasz as
>> >> sklearn.metrics.calinski_harabasz_score.
>> >
>> >Could you post a reference (most of our documentation points
>> >to "Wikipedia" or "MathWorld")?
>>
>> "Calinsk-iHarabasz" is the most popular evaluator for Centriod Clusters as I 
>> know.
>> I just read the code of sklearn, and think it easy to implement.
>> https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html
>> https://www.tandfonline.com/doi/abs/10.1080/03610927408827101
>
>Thanks; the original reference is quite fine too.
>
>> >
>> >> I think there should be a CalinskiHarabaszClusterEvaluator in commons 
>> >> math:
>> >
>> >At first sight, the approach would be to define a functional
>> >interface (with the "score" method).
>> >Then an "enum" that would be a factory of evaluators, along
>> >the lines of what has been done in "Commons RNG" (see class
>> >"RandomSource"[1]).
>>
>> I just inherit the design of "ClusterEvaluator",
>> and I think change the design of exists API is another question.
>
>Not really: IMHO we should not pile feature on top of an
>API that might have shortcomings.  In particular, the fact
>that the new calls' constructor calls the parent's constructor
>with "null" looks problematic to me.


IMHO "ClusterEvaluator" should be a interface(not just a functional interface) 
as I described below.

>
>> >
>> >> ```java
>> >> package org.apache.commons.math4.ml.clustering.evaluation;
>> >>
>> >> import org.apache.commons.math4.ml.clustering.Cluster;
>> >> import org.apache.commons.math4.ml.clustering.Clusterable;
>> >>
>> >> import java.util.List;
>> >>
>> >> public class CalinskiHarabaszClusterEvaluator<T extends Clusterable> 
>> >> extends
>> >> ClusterEvaluator<T> {
>> >>     @Override
>> >>     public double score(List<? extends Cluster<T>> clusters) {
>> >>         //TODO: Implement the Calinski-Harabasz Score algorithm
>> >>         return 0;
>> >>     }
>> >>
>> >>     @Override
>> >>     public boolean isBetterScore(double score1, double score2) {
>> >>         return score1 > score2;
>> >>     }
>> >
>> >This method does not seem very useful.
>
>I've now seen how this used by "MultiKMeansPlusPlusClusterer".
>However, I wonder why the "Multi" feature is only available for that
>implementation...
> 

SumOfClusterVariances is just one of the ClusterEvaluator algorithm.
IMHO it is necessary to tell the user which score is better for each 
ClusterEvaluator, 
but smaller is better cannot be the default implementation of 
ClusterEvaluator.isBetterScore,
and the name"isBetterScore" may be ambigous(Is the second score better than the 
first?)

The ClusterEvaluator can be a interface:
Solution 1: Compatible to old API:
```java
public interface ClusterEvaluator<T extends Clusterable>{
    double score(Collection<? extends Cluster<T>> clusters);
    // Keep old API for compatible
    boolean isBetterScore(double score1, double score2);
}
```
Solution 2: Use a explicit function name
```java
public interface ClusterEvaluator<T extends Clusterable>{
    double score(Collection<? extends Cluster<T>> clusters);
    // Use a explicit name
    boolean isScoreImproved(double originScore, double newScore);
}
```

>> >> }
>> >> ```
>> >>
>> >> The code can be implemented by read the algorithm documents,
>> >> or translate from python sklearn.metrics.calinski_harabasz_score.
>> >
>> >What's the license of that code?
>>
>> The sklearn is under the BSD license.
>
>OK; no problem[1] to have claimed inspiration then. ;-)
>
>Please note that, for tracking purpose, your PR should be tied
>to a JIRA report, and the issue's identifier should prefix the
>commit message. 

This PR is for discussion, I will create a JIRA issue.
But I still do not know how we make a conclusion.

>The PR is also not in sync with current "master" branch. 

Which branch should I pull?

>
>Regards,
>Gilles
>
>[1] http://www.apache.org/legal/resolved.html#category-a
>
>> I think math ml reference the sklearn so much,
>> for example: org.apache.commons.math4.userguide.ClusterAlgorithmComparison
>>
>> >
>> >Regards,
>> >Gilles
>> >
>> >[1] 
>> >https://commons.apache.org/proper/commons-rng/commons-rng-simple/javadocs/api-1.3/org/apache/commons/rng/simple/RandomSource.html
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [email protected]
>For additional commands, e-mail: [email protected]
>
>

Re: [math]Discuss: There should be a CalinskiHarabaszClusterEvaluator in ml package

Reply via email to