Re: [math]Discussion: How to move out "EmptyClusterStrategy" from KMeansPlusPlusClusterer

Gilles Sadowski Fri, 20 Mar 2020 19:28:44 -0700

2020-03-21 2:59 UTC+01:00, [email protected] <[email protected]>:
> Hi,
>
>>Le ven. 20 mars 2020 à 04:47, [email protected] <[email protected]> a écrit
>> :
>>>
>>> Hi,
>>>
>>> >Hello.
>>> >
>>> >Le mer. 18 mars 2020 à 17:57, Gilles Sadowski <[email protected]> a
>>> > écrit :
>>> >>
>>> >> Hi.
>>> >>
>>> >> 2020-03-18 15:10 UTC+01:00, [email protected] <[email protected]>:
>>> >> > Hi,
>>> >> >     I have created a PR to show my aim:
>>> >> > https://github.com/apache/commons-math/pull/126/files
>>> >>
>>> >> Am I correct that the implementations of "ClustersPointExtractor"
>>> >> modify the argument of the "extract" method?
>>> >> If so, that seems quite unsafe.  I would not expect this behaviour
>>> >> in a public API.
>>> >
>>> >To be clearer (hopefully): I'd contrast this (incomplete/non-compilable
>>> >code):
>>> >
>>> >public class Extractor {
>>> >    T extract(List<List<T>> list);
>>> >}
>>>
>>> I have read the exists code again, and I found "EmptyClusterStrategy"
>>>  and related logic only used in K-Means.
>>> And the "Extractor" remove a point from exists Clusters is indeed
>>> unsuitable
>>> to be a public API(as you mentioned before.)
>>> I think another choice is simple make "EmptyClusterStrategy" and related
>>> logic protected,
>>> then it can be resuse by MiniBatchKMeans.
>>>
>>> Also after the "CentroidInitializer" has been move out, the
>>> "KMeansPlusPlusClusterer"
>>>  should rename to "KMeansClusterer" with a construct parameter
>>> "CentroidInitializer"
>>
>>If you think that it makes sense to define a class hierarchy for
>>"KMeans"-based algorithms, please do so.  But note that "protected"
>>is only slightly better than "public" (from a maintenance POV, it isn't
>>better, since we'll be required to maintain compatibility all the same).
>>Perhaps the base classe(s) can be package-private...
>
> OK, package-private on "EmptyClusterStrategy" and related logic is fine.
>
>>
>>> >
>>> >with something along those lines:
>>> >
>>> >public class ClusterSet<T> {
>>> >    private Set<T> data;
>>> >    private List<List<T>> clusters;
>>> >
>>> >    void remove(T e) {
>>> >        return data.remove(e);
>>> >    }
>>> >
>>> >    public interface Visitor {
>>> >        T visit(List<List<T>> list);
>>> >    }
>>> >}
>>> >
>>> >Key point is separation of concern (selection of element vs
>>> >removal of element).
>>>
>>> I propose the ClusterSet should has more member like "distanceMeasure"
>>>  to store origin clustering parameters,
>>> or  to accelerate or support cluster finding or point finding.
>>> ```java
>>> public class ClusterSet<T> {
>>>     private List<Cluster> clusters;
>>>     private DistanceMeasure measure;
>>>     ...
>>>     public List<Cluster> closestTopN(Clusterable point);
>>
>>How about providing a comparator factory?
>>
>>public <T extends Clusterable> static
>> Comparator<Cluster<T>> getComparator(final Clusterable p, final
>>DistanceMeasure m) {
>>    return new Comparator<>() {
>>        public int compare(Cluster<T> c1, Cluster<T> c2) {
>>            final double d1 = c1.distance(p, m); // "distance" method
>>to be added to the interface (?).
>>            final double d2 = c2.distance(p, m);
>>            return d1 < d2 ? -1 : d1 > d2 ? 1 : 0;
>>        }
>>    }
>>}
>
> How do you think about the "distanceMeasure" to be a memeber of
> "ClusterSet",


I don't see why.  A priori "ClusterSet" represents groupings
of data, and the distance function is a parameter that e.g.
determines whether the grouping is good or bad.

> In my opinion "ClusterSet" should contains all the parameters and result for
> clustering and clusters.

IIRC, Alex mentioned "ClusterResult" (?).  IIUC, this would
indeed collect everything necessary to evaluate the quality
of the clustering (i.e. probably also the "DistanceMeasure"
instance used by the clustering algorithm).

>>
>>Then, if also providing an accessor to the clusters:
>>
>>public List<Cluster<T>> getClusters() {
>>    return Collections.umodifiableList(clusters);
>>}
>>
>>users can do whatever they want, a.o. easily implement the
>>ranking method which you suggest:
>>
>>/** @return clusters in descending order of proximity. */
>>public List<Cluster> closest(Clusterable p, ClusterSet<T> s,
>>DistanceMeasure m) {
>>    final List<Cluster<T>> cList = new ArrayList<>(s.getClusters());
>>    Collections.sort(cList, set.getComparator(p, m));
>>    return cList;
>>}
>>
>>>
>>>     public int pointSize();
>>> }
>>> ```
>>
>>A priori, I'd favour more encapsulation, in order to ensure that
>>multi-thread access can be controlled, e.g. hide the "ClusterSet"
>>data-structures, and provide access through identifiers that are
>>atomically created when clusters are added to the instance, and
>>so on...
>>
>>It would be great if you can come up with a concrete (and fully
>>tested) implementation of "ClusterSet". ;-)
>
> I would like to try, but it is a big change to API, we need more time and
> discussions.

Do you have specific questions?
Otherwise, there were already several ideas and hints for
a more flexible and more efficient design to get going...

Gilles

>>
>>> >
>>> >Of course the devil is in the details (maintain consistency among
>>> >the fields of "ClusterSet", ensure that "Visitor" implementations
>>> >cannot modify the internal state, ...).
>>> >
>>> >> Unless I missed some point, I'd ask again that the API be reviewed
>>> >> *before* implementing several features (such as those "extractors")
>>> >> on top of something that does not look right.
>>> >
>>> >One perspective might be to try and come with a design for use in
>>> >multi-threaded applications (see e.g. package "o.a.c.m.ml.neuralnet").
>>> >
>>> >Best regards,
>>> >Gilles
>>> >
>>> >> >> [...]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [math]Discussion: How to move out "EmptyClusterStrategy" from KMeansPlusPlusClusterer

Reply via email to