Hi,
On Sun, May 1, 2016 at 12:25 AM, Gilles <gil...@harfang.homelinux.org> wrote: Therefore I started to wonder why not to use RealVector > >> instead, since it has sparse implementation so I will be able to leverage >> it. >> > > The principle is fine; but I'm wary to use "RealVector" in new code > since it must be refactored... > > By refactoring, do you mean MATH-765? > Right now using kmeans++ clustering algorithm provided by common.maths >> it's not doable to cluster entire wikipedia dataset or any other huge >> datasets. >> > > Could you expand on this application? What is the data? > > The data is tf-idf matrix of wikipedia data set produced by gensim. I'm trying to run coreset algorithm with Spark streaming and I need kmeans++ for local computation. > > [One of the things to bear in mind while testing new implementations > is to not loose performance on the other classes of problems (or you'll > take heat from some of this list's observers...).] > > Is there any performance tests or regression, so it could be used to validate whenever new implementation doesn't introduced significant regression? Best, Artem. > > Best regards, > Gilles > > > >> >> Best regards, >> Artem Barger. >> >> On Sat, Apr 30, 2016 at 11:41 PM, Gilles <gil...@harfang.homelinux.org> >> wrote: >> >> On Mon, 25 Apr 2016 15:52:03 +0300, Artem Barger wrote: >>> >>> Hi All, >>>> >>>> I'd like to provide a solution for [MATH-1330] issue. Before starting I >>>> have a concerns regarding the possible design and the actual >>>> implementation. >>>> >>>> Currently all implementations of Clusterer interface expect to receive >>>> instance of DistanceMeasure class, which used to compute distance or >>>> metric >>>> between two points. Switching clustering algorithms to work with Vectors >>>> will make this unnecessary, therefore there will be no need to provide >>>> DistanceMeasure, since Vector class already provides methods to compute >>>> vector norms. >>>> >>>> >>> I think that reasons for using "double[]" in the "o.a.c.m.ml.clustering" >>> package were: >>> * simple and straightforward (fixed dimension Cartesian coordinates) >>> * not couple it with the "o.a.c.m.linear" package whose "RealVector" is >>> for variable size sequences of elements (and is also, inconsistently, >>> used as a Cartesian vector, and also as column- and row-matrix[1]) >>> >>> It is arguable adapted for a family of problems which the developer >>> probably had in mind when taking those design decisions. >>> >>> It would be interesting to know for which class of problems, the design >>> is inappropriate, in order to clarify ideas. >>> >>> The main drawback of this approach is that we will loose the ability to >>> >>>> control which metric to use during clustering, however the only classes >>>> which make an implicit use of this parameters are: Clusterer and >>>> KmeansPlusPlusClusterer all others assumes EucledianDistance by default. >>>> >>>> >>> There is a default indeed, but all "Clusterer" implementations use >>> whatever "DistanceMeasure" has been passed to the constructor. >>> >>> Assuming that "RealVector" knows how to compute the distance means that >>> users will have to implement their own subclass of "RealVector" and >>> override "getDistance(RealVector)" if they want another distance. >>> Alternatively, CM would have to define all these classes. >>> >>> At first sight, it does not seem the right way to go... >>> >>> One of the possible approaches is to extend DistanceMeasure interface to >>> be >>> >>>> able to compute distance between two vectors? After all it's only sub >>>> first >>>> vector from the second and compute desired norm on the result. >>>> >>>> >>> Seems good (at first sight) but (IMHO) only if we implement a new >>> "CartesianVector" class unencumbered with all the cruft of "RealVector". >>> >>> Another possible solution is to make vector to return it's coordinates, >>> >>>> hence it avail us to use DistanceMeasure as is. Personally I do not >>>> think >>>> this is good approach, since it will make no sense with sparse vectors. >>>> >>>> >>> Ruled out indeed if it conflicts with your intended usage. >>> >>> Last alternative this comes to my mind is to create a set of enums to >>> >>>> indicate which vector norm to use to compute distances, also do no think >>>> this is very good solution, since sounds too intrusive and might break >>>> backward compatibility. >>>> >>>> >>> And forward compatibility (clustering code will have to be adapted if >>> another distance is added later). >>> >>> What do you think? Am I missing something? Is there a better possible way >>> >>>> to achieve the goal? >>>> >>>> >>> As indicated above, a practical example might help visualize the options. >>> >>> >>> Regards, >>> Gilles >>> >>> [1] Cf. https://issues.apache.org/jira/browse/MATH-765 >>> >>> >>> Best regards, >>>> Artem Barger. >>>> >>>> >>> >>> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > For additional commands, e-mail: dev-h...@commons.apache.org > >