Hi, >2020-03-21 2:59 UTC+01:00, chentao...@qq.com <chentao...@qq.com>: >> Hi, >> >>>Le ven. 20 mars 2020 à 04:47, chentao...@qq.com <chentao...@qq.com> a écrit >>> : >>>> >>>> Hi, >>>> >>>> >Hello. >>>> > >>>> >Le mer. 18 mars 2020 à 17:57, Gilles Sadowski <gillese...@gmail.com> a >>>> > écrit : >>>> >> >>>> >> Hi. >>>> >> >>>> >> 2020-03-18 15:10 UTC+01:00, chentao...@qq.com <chentao...@qq.com>: >>>> >> > Hi, >>>> >> > I have created a PR to show my aim: >>>> >> > https://github.com/apache/commons-math/pull/126/files >>>> >> >>>> >> Am I correct that the implementations of "ClustersPointExtractor" >>>> >> modify the argument of the "extract" method? >>>> >> If so, that seems quite unsafe. I would not expect this behaviour >>>> >> in a public API. >>>> > >>>> >To be clearer (hopefully): I'd contrast this (incomplete/non-compilable >>>> >code): >>>> > >>>> >public class Extractor { >>>> > T extract(List<List<T>> list); >>>> >} >>>> >>>> I have read the exists code again, and I found "EmptyClusterStrategy" >>>> and related logic only used in K-Means. >>>> And the "Extractor" remove a point from exists Clusters is indeed >>>> unsuitable >>>> to be a public API(as you mentioned before.) >>>> I think another choice is simple make "EmptyClusterStrategy" and related >>>> logic protected, >>>> then it can be resuse by MiniBatchKMeans. >>>> >>>> Also after the "CentroidInitializer" has been move out, the >>>> "KMeansPlusPlusClusterer" >>>> should rename to "KMeansClusterer" with a construct parameter >>>> "CentroidInitializer" >>> >>>If you think that it makes sense to define a class hierarchy for >>>"KMeans"-based algorithms, please do so. But note that "protected" >>>is only slightly better than "public" (from a maintenance POV, it isn't >>>better, since we'll be required to maintain compatibility all the same). >>>Perhaps the base classe(s) can be package-private... >> >> OK, package-private on "EmptyClusterStrategy" and related logic is fine. >> >>> >>>> > >>>> >with something along those lines: >>>> > >>>> >public class ClusterSet<T> { >>>> > private Set<T> data; >>>> > private List<List<T>> clusters; >>>> > >>>> > void remove(T e) { >>>> > return data.remove(e); >>>> > } >>>> > >>>> > public interface Visitor { >>>> > T visit(List<List<T>> list); >>>> > } >>>> >} >>>> > >>>> >Key point is separation of concern (selection of element vs >>>> >removal of element). >>>> >>>> I propose the ClusterSet should has more member like "distanceMeasure" >>>> to store origin clustering parameters, >>>> or to accelerate or support cluster finding or point finding. >>>> ```java >>>> public class ClusterSet<T> { >>>> private List<Cluster> clusters; >>>> private DistanceMeasure measure; >>>> ... >>>> public List<Cluster> closestTopN(Clusterable point); >>> >>>How about providing a comparator factory? >>> >>>public <T extends Clusterable> static >>> Comparator<Cluster<T>> getComparator(final Clusterable p, final >>>DistanceMeasure m) { >>> return new Comparator<>() { >>> public int compare(Cluster<T> c1, Cluster<T> c2) { >>> final double d1 = c1.distance(p, m); // "distance" method >>>to be added to the interface (?). >>> final double d2 = c2.distance(p, m); >>> return d1 < d2 ? -1 : d1 > d2 ? 1 : 0; >>> } >>> } >>>} >> >> How do you think about the "distanceMeasure" to be a memeber of >> "ClusterSet", > >I don't see why. A priori "ClusterSet" represents groupings >of data, and the distance function is a parameter that e.g. >determines whether the grouping is good or bad. > >> In my opinion "ClusterSet" should contains all the parameters and result for >> clustering and clusters. > >IIRC, Alex mentioned "ClusterResult" (?). IIUC, this would >indeed collect everything necessary to evaluate the quality >of the clustering (i.e. probably also the "DistanceMeasure" >instance used by the clustering algorithm).
My understanding is "ClusterSet" and "ClusterResult" is one thing. Did I missed something? > >>> >>>Then, if also providing an accessor to the clusters: >>> >>>public List<Cluster<T>> getClusters() { >>> return Collections.umodifiableList(clusters); >>>} >>> >>>users can do whatever they want, a.o. easily implement the >>>ranking method which you suggest: >>> >>>/** @return clusters in descending order of proximity. */ >>>public List<Cluster> closest(Clusterable p, ClusterSet<T> s, >>>DistanceMeasure m) { >>> final List<Cluster<T>> cList = new ArrayList<>(s.getClusters()); >>> Collections.sort(cList, set.getComparator(p, m)); >>> return cList; >>>} >>> >>>> >>>> public int pointSize(); >>>> } >>>> ``` >>> >>>A priori, I'd favour more encapsulation, in order to ensure that >>>multi-thread access can be controlled, e.g. hide the "ClusterSet" >>>data-structures, and provide access through identifiers that are >>>atomically created when clusters are added to the instance, and >>>so on... >>> >>>It would be great if you can come up with a concrete (and fully >>>tested) implementation of "ClusterSet". ;-) >> >> I would like to try, but it is a big change to API, we need more time and >> discussions. > >Do you have specific questions? >Otherwise, there were already several ideas and hints for >a more flexible and more efficient design to get going... > >Gilles > >>> >>>> > >>>> >Of course the devil is in the details (maintain consistency among >>>> >the fields of "ClusterSet", ensure that "Visitor" implementations >>>> >cannot modify the internal state, ...). >>>> > >>>> >> Unless I missed some point, I'd ask again that the API be reviewed >>>> >> *before* implementing several features (such as those "extractors") >>>> >> on top of something that does not look right. >>>> > >>>> >One perspective might be to try and come with a design for use in >>>> >multi-threaded applications (see e.g. package "o.a.c.m.ml.neuralnet"). >>>> > >>>> >Best regards, >>>> >Gilles >>>> > >>>> >> >> [...] > >--------------------------------------------------------------------- >To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >For additional commands, e-mail: dev-h...@commons.apache.org > >