Hello. 2020-03-21 4:02 UTC+01:00, chentao...@qq.com <chentao...@qq.com>: > Hi, > >>2020-03-21 2:59 UTC+01:00, chentao...@qq.com <chentao...@qq.com>: >>> Hi, >>> >>>>Le ven. 20 mars 2020 à 04:47, chentao...@qq.com <chentao...@qq.com> a >>>> écrit >>>> : >>>>> >>>>> Hi, >>>>> >>>>> >Hello. >>>>> > >>>>> >Le mer. 18 mars 2020 à 17:57, Gilles Sadowski <gillese...@gmail.com> a >>>>> > écrit : >>>>> >> >>>>> >> Hi. >>>>> >> >>>>> >> 2020-03-18 15:10 UTC+01:00, chentao...@qq.com <chentao...@qq.com>: >>>>> >> > Hi, >>>>> >> > I have created a PR to show my aim: >>>>> >> > https://github.com/apache/commons-math/pull/126/files >>>>> >> >>>>> >> Am I correct that the implementations of "ClustersPointExtractor" >>>>> >> modify the argument of the "extract" method? >>>>> >> If so, that seems quite unsafe. I would not expect this behaviour >>>>> >> in a public API. >>>>> > >>>>> >To be clearer (hopefully): I'd contrast this >>>>> > (incomplete/non-compilable >>>>> >code): >>>>> > >>>>> >public class Extractor { >>>>> > T extract(List<List<T>> list); >>>>> >} >>>>> >>>>> I have read the exists code again, and I found "EmptyClusterStrategy" >>>>> and related logic only used in K-Means. >>>>> And the "Extractor" remove a point from exists Clusters is indeed >>>>> unsuitable >>>>> to be a public API(as you mentioned before.) >>>>> I think another choice is simple make "EmptyClusterStrategy" and >>>>> related >>>>> logic protected, >>>>> then it can be resuse by MiniBatchKMeans. >>>>> >>>>> Also after the "CentroidInitializer" has been move out, the >>>>> "KMeansPlusPlusClusterer" >>>>> should rename to "KMeansClusterer" with a construct parameter >>>>> "CentroidInitializer" >>>> >>>>If you think that it makes sense to define a class hierarchy for >>>>"KMeans"-based algorithms, please do so. But note that "protected" >>>>is only slightly better than "public" (from a maintenance POV, it isn't >>>>better, since we'll be required to maintain compatibility all the same). >>>>Perhaps the base classe(s) can be package-private... >>> >>> OK, package-private on "EmptyClusterStrategy" and related logic is fine. >>> >>>> >>>>> > >>>>> >with something along those lines: >>>>> > >>>>> >public class ClusterSet<T> { >>>>> > private Set<T> data; >>>>> > private List<List<T>> clusters; >>>>> > >>>>> > void remove(T e) { >>>>> > return data.remove(e); >>>>> > } >>>>> > >>>>> > public interface Visitor { >>>>> > T visit(List<List<T>> list); >>>>> > } >>>>> >} >>>>> > >>>>> >Key point is separation of concern (selection of element vs >>>>> >removal of element). >>>>> >>>>> I propose the ClusterSet should has more member like "distanceMeasure" >>>>> to store origin clustering parameters, >>>>> or to accelerate or support cluster finding or point finding. >>>>> ```java >>>>> public class ClusterSet<T> { >>>>> private List<Cluster> clusters; >>>>> private DistanceMeasure measure; >>>>> ... >>>>> public List<Cluster> closestTopN(Clusterable point); >>>> >>>>How about providing a comparator factory? >>>> >>>>public <T extends Clusterable> static >>>> Comparator<Cluster<T>> getComparator(final Clusterable p, final >>>>DistanceMeasure m) { >>>> return new Comparator<>() { >>>> public int compare(Cluster<T> c1, Cluster<T> c2) { >>>> final double d1 = c1.distance(p, m); // "distance" method >>>>to be added to the interface (?). >>>> final double d2 = c2.distance(p, m); >>>> return d1 < d2 ? -1 : d1 > d2 ? 1 : 0; >>>> } >>>> } >>>>} >>> >>> How do you think about the "distanceMeasure" to be a memeber of >>> "ClusterSet", >> >>I don't see why. A priori "ClusterSet" represents groupings >>of data, and the distance function is a parameter that e.g. >>determines whether the grouping is good or bad. >> >>> In my opinion "ClusterSet" should contains all the parameters and result >>> for >>> clustering and clusters. >> >>IIRC, Alex mentioned "ClusterResult" (?). IIUC, this would >>indeed collect everything necessary to evaluate the quality >>of the clustering (i.e. probably also the "DistanceMeasure" >>instance used by the clustering algorithm). > > My understanding is "ClusterSet" and "ClusterResult" is one thing.
I don't think so. IMO, the idea is that "ClusterSet" will hold data-structures like "List<Cluster<Clusterable>>" together with methods to manipulate individual clusters (in a thread-safe way). While "ClusterResult" would be the return type of a call to the "cluster(Collection<Clusterable>)" method of a clustering algorithm implementation, and would contain a "ClusterSet" instance together with any additional fields deemed necessary to analyze the result. > Did I missed something? I hope that I didn't, and that the above paragraph dispelled the confusion. Regards, Gilles >> >>>> >>>>Then, if also providing an accessor to the clusters: >>>> >>>>public List<Cluster<T>> getClusters() { >>>> return Collections.umodifiableList(clusters); >>>>} >>>> >>>>users can do whatever they want, a.o. easily implement the >>>>ranking method which you suggest: >>>> >>>>/** @return clusters in descending order of proximity. */ >>>>public List<Cluster> closest(Clusterable p, ClusterSet<T> s, >>>>DistanceMeasure m) { >>>> final List<Cluster<T>> cList = new ArrayList<>(s.getClusters()); >>>> Collections.sort(cList, set.getComparator(p, m)); >>>> return cList; >>>>} >>>> >>>>> >>>>> public int pointSize(); >>>>> } >>>>> ``` >>>> >>>>A priori, I'd favour more encapsulation, in order to ensure that >>>>multi-thread access can be controlled, e.g. hide the "ClusterSet" >>>>data-structures, and provide access through identifiers that are >>>>atomically created when clusters are added to the instance, and >>>>so on... >>>> >>>>It would be great if you can come up with a concrete (and fully >>>>tested) implementation of "ClusterSet". ;-) >>> >>> I would like to try, but it is a big change to API, we need more time and >>> discussions. >> >>Do you have specific questions? >>Otherwise, there were already several ideas and hints for >>a more flexible and more efficient design to get going... >> >>Gilles >> >>>> >>>>> > >>>>> >Of course the devil is in the details (maintain consistency among >>>>> >the fields of "ClusterSet", ensure that "Visitor" implementations >>>>> >cannot modify the internal state, ...). >>>>> > >>>>> >> Unless I missed some point, I'd ask again that the API be reviewed >>>>> >> *before* implementing several features (such as those "extractors") >>>>> >> on top of something that does not look right. >>>>> > >>>>> >One perspective might be to try and come with a design for use in >>>>> >multi-threaded applications (see e.g. package "o.a.c.m.ml.neuralnet"). >>>>> > >>>>> >Best regards, >>>>> >Gilles >>>>> > >>>>> >> >> [...] >> >>--------------------------------------------------------------------- >>To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >>For additional commands, e-mail: dev-h...@commons.apache.org >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org