Re: [math]Discuss: There should be a CalinskiHarabaszClusterEvaluator in ml package

Gilles Sadowski Sat, 07 Mar 2020 04:01:07 -0800

Hello.

>>> [...]
>>> >>     For machine learning centroid cluster algorithm, we often use is
>>> >> Calinsk-iHarabasz score to evaluate which algorithm or how many
>>> >> centers is
>>> >> best for a dataset.
>>> >>     The python lib sklearn implements Calinsk-iHarabasz as
>>> >> sklearn.metrics.calinski_harabasz_score.
>>> >
>>> >Could you post a reference (most of our documentation points
>>> >to "Wikipedia" or "MathWorld")?
>>>
>>> "Calinsk-iHarabasz" is the most popular evaluator for Centriod Clusters
>>> as I know.
>>> I just read the code of sklearn, and think it easy to implement.
>>> https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html
>>> https://www.tandfonline.com/doi/abs/10.1080/03610927408827101
>>
>>Thanks; the original reference is quite fine too.
>>
>>> >
>>> >> I think there should be a CalinskiHarabaszClusterEvaluator in commons
>>> >> math:
>>> >
>>> >At first sight, the approach would be to define a functional
>>> >interface (with the "score" method).
>>> >Then an "enum" that would be a factory of evaluators, along
>>> >the lines of what has been done in "Commons RNG" (see class
>>> >"RandomSource"[1]).
>>>
>>> I just inherit the design of "ClusterEvaluator",
>>> and I think change the design of exists API is another question.
>>
>>Not really: IMHO we should not pile feature on top of an
>>API that might have shortcomings.  In particular, the fact
>>that the new calls' constructor calls the parent's constructor
>>with "null" looks problematic to me.
>
> IMHO "ClusterEvaluator" should be a interface(not just a functional
> interface) as I described below.


How about renaming it something more explicit and
enforce unambiguous semantics for the ordering?  E.g.

@FunctionalInterface
public interface ClusterRanking<T extends Clusterable> {
    /**
     * Computes the rank (higher is better).
     *
     * @param clusters Clusters to be evaluated.
     * @return the rank of the provided {@code clusters}.
     */
    double compute(List<? extends Cluster<T>> clusters);
}

>>
>>> >
>>> >> ```java
>>> >> package org.apache.commons.math4.ml.clustering.evaluation;
>>> >>
>>> >> import org.apache.commons.math4.ml.clustering.Cluster;
>>> >> import org.apache.commons.math4.ml.clustering.Clusterable;
>>> >>
>>> >> import java.util.List;
>>> >>
>>> >> public class CalinskiHarabaszClusterEvaluator<T extends Clusterable>
>>> >> extends
>>> >> ClusterEvaluator<T> {
>>> >>     @Override
>>> >>     public double score(List<? extends Cluster<T>> clusters) {
>>> >>         //TODO: Implement the Calinski-Harabasz Score algorithm
>>> >>         return 0;
>>> >>     }
>>> >>
>>> >>     @Override
>>> >>     public boolean isBetterScore(double score1, double score2) {
>>> >>         return score1 > score2;
>>> >>     }
>>> >
>>> >This method does not seem very useful.
>>
>>I've now seen how this used by "MultiKMeansPlusPlusClusterer".
>>However, I wonder why the "Multi" feature is only available for that
>>implementation...
>>
>
> SumOfClusterVariances is just one of the ClusterEvaluator algorithm.
> IMHO it is necessary to tell the user which score is better for each
> ClusterEvaluator,

Not if we enforce semantics (cf. above).

> but smaller is better cannot be the default implementation
> of ClusterEvaluator.isBetterScore,
> and the name"isBetterScore" may be ambigous(Is the second score better than
> the first?)

The convention is set by the documentation.  However,
the current API could be made simpler with the above
proposal.

>
> The ClusterEvaluator can be a interface:
> Solution 1: Compatible to old API:
> ```java
> public interface ClusterEvaluator<T extends Clusterable>{
>     double score(Collection<? extends Cluster<T>> clusters);
>     // Keep old API for compatible
>     boolean isBetterScore(double score1, double score2);
> }
> ```
> Solution 2: Use a explicit function name
> ```java
> public interface ClusterEvaluator<T extends Clusterable>{
>     double score(Collection<? extends Cluster<T>> clusters);
>     // Use a explicit name
>     boolean isScoreImproved(double originScore, double newScore);
> }
> ```

Solution 3  is "ClusterRanking".
In cases where the reference algorithm would assume the
other convention (i.e. "lower is better"), the implementation
is required to apply a conversion (e.g. return the opposite).

>>> >> }
>>> >> ```
>>> >>
>>> >> The code can be implemented by read the algorithm documents,
>>> >> or translate from python sklearn.metrics.calinski_harabasz_score.
>>> >
>>> >What's the license of that code?
>>>
>>> The sklearn is under the BSD license.
>>
>>OK; no problem[1] to have claimed inspiration then. ;-)
>>
>>Please note that, for tracking purpose, your PR should be tied
>>to a JIRA report, and the issue's identifier should prefix the
>>commit message.
>
> This PR is for discussion, I will create a JIRA issue.

Proposals and conceptual discussions are posted here (in this
case, you proposed to contribute a new "ClusterEvaluator".)
If there are no objections, a JIRA report should then be filed, to
which you can attach PRs (this has the advantage of being
automatically tracked).

> But I still do not know how we make a conclusion.

Implementations details will be discussed in JIRA comments
(not GitHub as far as I'm concerned).
[Issues that have more far-reaching consequences can be
posted back here.]

>
>>The PR is also not in sync with current "master" branch.
>
> Which branch should I pull?

"master".
You could probably "rebase" your branch on it; try
  $ git rebase master


Regards,
Gilles

>>> [...]

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [math]Discuss: There should be a CalinskiHarabaszClusterEvaluator in ml package

Reply via email to