Re: [math]Discussion: How to move out "EmptyClusterStrategy" from KMeansPlusPlusClusterer

chentao...@qq.com Fri, 20 Mar 2020 20:02:44 -0700

Hi,

>2020-03-21 2:59 UTC+01:00, chentao...@qq.com <chentao...@qq.com>:
>> Hi,
>>
>>>Le ven. 20 mars 2020 à 04:47, chentao...@qq.com <chentao...@qq.com> a écrit
>>> :
>>>>
>>>> Hi,
>>>>
>>>> >Hello.
>>>> >
>>>> >Le mer. 18 mars 2020 à 17:57, Gilles Sadowski <gillese...@gmail.com> a
>>>> > écrit :
>>>> >>
>>>> >> Hi.
>>>> >>
>>>> >> 2020-03-18 15:10 UTC+01:00, chentao...@qq.com <chentao...@qq.com>:
>>>> >> > Hi,
>>>> >> >     I have created a PR to show my aim:
>>>> >> > https://github.com/apache/commons-math/pull/126/files
>>>> >>
>>>> >> Am I correct that the implementations of "ClustersPointExtractor"
>>>> >> modify the argument of the "extract" method?
>>>> >> If so, that seems quite unsafe.  I would not expect this behaviour
>>>> >> in a public API.
>>>> >
>>>> >To be clearer (hopefully): I'd contrast this (incomplete/non-compilable
>>>> >code):
>>>> >
>>>> >public class Extractor {
>>>> >    T extract(List<List<T>> list);
>>>> >}
>>>>
>>>> I have read the exists code again, and I found "EmptyClusterStrategy"
>>>>  and related logic only used in K-Means.
>>>> And the "Extractor" remove a point from exists Clusters is indeed
>>>> unsuitable
>>>> to be a public API(as you mentioned before.)
>>>> I think another choice is simple make "EmptyClusterStrategy" and related
>>>> logic protected,
>>>> then it can be resuse by MiniBatchKMeans.
>>>>
>>>> Also after the "CentroidInitializer" has been move out, the
>>>> "KMeansPlusPlusClusterer"
>>>>  should rename to "KMeansClusterer" with a construct parameter
>>>> "CentroidInitializer"
>>>
>>>If you think that it makes sense to define a class hierarchy for
>>>"KMeans"-based algorithms, please do so.  But note that "protected"
>>>is only slightly better than "public" (from a maintenance POV, it isn't
>>>better, since we'll be required to maintain compatibility all the same).
>>>Perhaps the base classe(s) can be package-private...
>>
>> OK, package-private on "EmptyClusterStrategy" and related logic is fine.
>>
>>>
>>>> >
>>>> >with something along those lines:
>>>> >
>>>> >public class ClusterSet<T> {
>>>> >    private Set<T> data;
>>>> >    private List<List<T>> clusters;
>>>> >
>>>> >    void remove(T e) {
>>>> >        return data.remove(e);
>>>> >    }
>>>> >
>>>> >    public interface Visitor {
>>>> >        T visit(List<List<T>> list);
>>>> >    }
>>>> >}
>>>> >
>>>> >Key point is separation of concern (selection of element vs
>>>> >removal of element).
>>>>
>>>> I propose the ClusterSet should has more member like "distanceMeasure"
>>>>  to store origin clustering parameters,
>>>> or  to accelerate or support cluster finding or point finding.
>>>> ```java
>>>> public class ClusterSet<T> {
>>>>     private List<Cluster> clusters;
>>>>     private DistanceMeasure measure;
>>>>     ...
>>>>     public List<Cluster> closestTopN(Clusterable point);
>>>
>>>How about providing a comparator factory?
>>>
>>>public <T extends Clusterable> static
>>> Comparator<Cluster<T>> getComparator(final Clusterable p, final
>>>DistanceMeasure m) {
>>>    return new Comparator<>() {
>>>        public int compare(Cluster<T> c1, Cluster<T> c2) {
>>>            final double d1 = c1.distance(p, m); // "distance" method
>>>to be added to the interface (?).
>>>            final double d2 = c2.distance(p, m);
>>>            return d1 < d2 ? -1 : d1 > d2 ? 1 : 0;
>>>        }
>>>    }
>>>}
>>
>> How do you think about the "distanceMeasure" to be a memeber of
>> "ClusterSet",
>
>I don't see why.  A priori "ClusterSet" represents groupings
>of data, and the distance function is a parameter that e.g.
>determines whether the grouping is good or bad.
>
>> In my opinion "ClusterSet" should contains all the parameters and result for
>> clustering and clusters.
>
>IIRC, Alex mentioned "ClusterResult" (?).  IIUC, this would
>indeed collect everything necessary to evaluate the quality
>of the clustering (i.e. probably also the "DistanceMeasure"
>instance used by the clustering algorithm).


My understanding is "ClusterSet" and "ClusterResult" is one thing.
Did I missed something?

>
>>>
>>>Then, if also providing an accessor to the clusters:
>>>
>>>public List<Cluster<T>> getClusters() {
>>>    return Collections.umodifiableList(clusters);
>>>}
>>>
>>>users can do whatever they want, a.o. easily implement the
>>>ranking method which you suggest:
>>>
>>>/** @return clusters in descending order of proximity. */
>>>public List<Cluster> closest(Clusterable p, ClusterSet<T> s,
>>>DistanceMeasure m) {
>>>    final List<Cluster<T>> cList = new ArrayList<>(s.getClusters());
>>>    Collections.sort(cList, set.getComparator(p, m));
>>>    return cList;
>>>}
>>>
>>>>
>>>>     public int pointSize();
>>>> }
>>>> ```
>>>
>>>A priori, I'd favour more encapsulation, in order to ensure that
>>>multi-thread access can be controlled, e.g. hide the "ClusterSet"
>>>data-structures, and provide access through identifiers that are
>>>atomically created when clusters are added to the instance, and
>>>so on...
>>>
>>>It would be great if you can come up with a concrete (and fully
>>>tested) implementation of "ClusterSet". ;-)
>>
>> I would like to try, but it is a big change to API, we need more time and
>> discussions.
>
>Do you have specific questions?
>Otherwise, there were already several ideas and hints for
>a more flexible and more efficient design to get going...
>
>Gilles
>
>>>
>>>> >
>>>> >Of course the devil is in the details (maintain consistency among
>>>> >the fields of "ClusterSet", ensure that "Visitor" implementations
>>>> >cannot modify the internal state, ...).
>>>> >
>>>> >> Unless I missed some point, I'd ask again that the API be reviewed
>>>> >> *before* implementing several features (such as those "extractors")
>>>> >> on top of something that does not look right.
>>>> >
>>>> >One perspective might be to try and come with a design for use in
>>>> >multi-threaded applications (see e.g. package "o.a.c.m.ml.neuralnet").
>>>> >
>>>> >Best regards,
>>>> >Gilles
>>>> >
>>>> >> >> [...]
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>For additional commands, e-mail: dev-h...@commons.apache.org
>
>

Re: [math]Discussion: How to move out "EmptyClusterStrategy" from KMeansPlusPlusClusterer

Reply via email to