Re: [math]Discussion: How to move out "EmptyClusterStrategy" from KMeansPlusPlusClusterer

Gilles Sadowski Sat, 21 Mar 2020 05:16:08 -0700

Hello.

2020-03-21 4:02 UTC+01:00, chentao...@qq.com <chentao...@qq.com>:
> Hi,
>
>>2020-03-21 2:59 UTC+01:00, chentao...@qq.com <chentao...@qq.com>:
>>> Hi,
>>>
>>>>Le ven. 20 mars 2020 à 04:47, chentao...@qq.com <chentao...@qq.com> a
>>>> écrit
>>>> :
>>>>>
>>>>> Hi,
>>>>>
>>>>> >Hello.
>>>>> >
>>>>> >Le mer. 18 mars 2020 à 17:57, Gilles Sadowski <gillese...@gmail.com> a
>>>>> > écrit :
>>>>> >>
>>>>> >> Hi.
>>>>> >>
>>>>> >> 2020-03-18 15:10 UTC+01:00, chentao...@qq.com <chentao...@qq.com>:
>>>>> >> > Hi,
>>>>> >> >     I have created a PR to show my aim:
>>>>> >> > https://github.com/apache/commons-math/pull/126/files
>>>>> >>
>>>>> >> Am I correct that the implementations of "ClustersPointExtractor"
>>>>> >> modify the argument of the "extract" method?
>>>>> >> If so, that seems quite unsafe.  I would not expect this behaviour
>>>>> >> in a public API.
>>>>> >
>>>>> >To be clearer (hopefully): I'd contrast this
>>>>> > (incomplete/non-compilable
>>>>> >code):
>>>>> >
>>>>> >public class Extractor {
>>>>> >    T extract(List<List<T>> list);
>>>>> >}
>>>>>
>>>>> I have read the exists code again, and I found "EmptyClusterStrategy"
>>>>>  and related logic only used in K-Means.
>>>>> And the "Extractor" remove a point from exists Clusters is indeed
>>>>> unsuitable
>>>>> to be a public API(as you mentioned before.)
>>>>> I think another choice is simple make "EmptyClusterStrategy" and
>>>>> related
>>>>> logic protected,
>>>>> then it can be resuse by MiniBatchKMeans.
>>>>>
>>>>> Also after the "CentroidInitializer" has been move out, the
>>>>> "KMeansPlusPlusClusterer"
>>>>>  should rename to "KMeansClusterer" with a construct parameter
>>>>> "CentroidInitializer"
>>>>
>>>>If you think that it makes sense to define a class hierarchy for
>>>>"KMeans"-based algorithms, please do so.  But note that "protected"
>>>>is only slightly better than "public" (from a maintenance POV, it isn't
>>>>better, since we'll be required to maintain compatibility all the same).
>>>>Perhaps the base classe(s) can be package-private...
>>>
>>> OK, package-private on "EmptyClusterStrategy" and related logic is fine.
>>>
>>>>
>>>>> >
>>>>> >with something along those lines:
>>>>> >
>>>>> >public class ClusterSet<T> {
>>>>> >    private Set<T> data;
>>>>> >    private List<List<T>> clusters;
>>>>> >
>>>>> >    void remove(T e) {
>>>>> >        return data.remove(e);
>>>>> >    }
>>>>> >
>>>>> >    public interface Visitor {
>>>>> >        T visit(List<List<T>> list);
>>>>> >    }
>>>>> >}
>>>>> >
>>>>> >Key point is separation of concern (selection of element vs
>>>>> >removal of element).
>>>>>
>>>>> I propose the ClusterSet should has more member like "distanceMeasure"
>>>>>  to store origin clustering parameters,
>>>>> or  to accelerate or support cluster finding or point finding.
>>>>> ```java
>>>>> public class ClusterSet<T> {
>>>>>     private List<Cluster> clusters;
>>>>>     private DistanceMeasure measure;
>>>>>     ...
>>>>>     public List<Cluster> closestTopN(Clusterable point);
>>>>
>>>>How about providing a comparator factory?
>>>>
>>>>public <T extends Clusterable> static
>>>> Comparator<Cluster<T>> getComparator(final Clusterable p, final
>>>>DistanceMeasure m) {
>>>>    return new Comparator<>() {
>>>>        public int compare(Cluster<T> c1, Cluster<T> c2) {
>>>>            final double d1 = c1.distance(p, m); // "distance" method
>>>>to be added to the interface (?).
>>>>            final double d2 = c2.distance(p, m);
>>>>            return d1 < d2 ? -1 : d1 > d2 ? 1 : 0;
>>>>        }
>>>>    }
>>>>}
>>>
>>> How do you think about the "distanceMeasure" to be a memeber of
>>> "ClusterSet",
>>
>>I don't see why.  A priori "ClusterSet" represents groupings
>>of data, and the distance function is a parameter that e.g.
>>determines whether the grouping is good or bad.
>>
>>> In my opinion "ClusterSet" should contains all the parameters and result
>>> for
>>> clustering and clusters.
>>
>>IIRC, Alex mentioned "ClusterResult" (?).  IIUC, this would
>>indeed collect everything necessary to evaluate the quality
>>of the clustering (i.e. probably also the "DistanceMeasure"
>>instance used by the clustering algorithm).
>
> My understanding is "ClusterSet" and "ClusterResult" is one thing.


I don't think so.
IMO, the idea is that "ClusterSet" will hold data-structures
like "List<Cluster<Clusterable>>" together with methods
to manipulate individual clusters (in a thread-safe way).
While "ClusterResult" would be the return type of a call
to the "cluster(Collection<Clusterable>)" method of a
clustering algorithm implementation, and would contain
a "ClusterSet" instance together with any additional fields
deemed necessary to analyze the result.

> Did I missed something?

I hope that I didn't, and that the above paragraph dispelled
the confusion.

Regards,
Gilles

>>
>>>>
>>>>Then, if also providing an accessor to the clusters:
>>>>
>>>>public List<Cluster<T>> getClusters() {
>>>>    return Collections.umodifiableList(clusters);
>>>>}
>>>>
>>>>users can do whatever they want, a.o. easily implement the
>>>>ranking method which you suggest:
>>>>
>>>>/** @return clusters in descending order of proximity. */
>>>>public List<Cluster> closest(Clusterable p, ClusterSet<T> s,
>>>>DistanceMeasure m) {
>>>>    final List<Cluster<T>> cList = new ArrayList<>(s.getClusters());
>>>>    Collections.sort(cList, set.getComparator(p, m));
>>>>    return cList;
>>>>}
>>>>
>>>>>
>>>>>     public int pointSize();
>>>>> }
>>>>> ```
>>>>
>>>>A priori, I'd favour more encapsulation, in order to ensure that
>>>>multi-thread access can be controlled, e.g. hide the "ClusterSet"
>>>>data-structures, and provide access through identifiers that are
>>>>atomically created when clusters are added to the instance, and
>>>>so on...
>>>>
>>>>It would be great if you can come up with a concrete (and fully
>>>>tested) implementation of "ClusterSet". ;-)
>>>
>>> I would like to try, but it is a big change to API, we need more time and
>>> discussions.
>>
>>Do you have specific questions?
>>Otherwise, there were already several ideas and hints for
>>a more flexible and more efficient design to get going...
>>
>>Gilles
>>
>>>>
>>>>> >
>>>>> >Of course the devil is in the details (maintain consistency among
>>>>> >the fields of "ClusterSet", ensure that "Visitor" implementations
>>>>> >cannot modify the internal state, ...).
>>>>> >
>>>>> >> Unless I missed some point, I'd ask again that the API be reviewed
>>>>> >> *before* implementing several features (such as those "extractors")
>>>>> >> on top of something that does not look right.
>>>>> >
>>>>> >One perspective might be to try and come with a design for use in
>>>>> >multi-threaded applications (see e.g. package "o.a.c.m.ml.neuralnet").
>>>>> >
>>>>> >Best regards,
>>>>> >Gilles
>>>>> >
>>>>> >> >> [...]
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>For additional commands, e-mail: dev-h...@commons.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [math]Discussion: How to move out "EmptyClusterStrategy" from KMeansPlusPlusClusterer

Reply via email to