Re: [math]: [MATH-1330] - KMeans clustering algorithm, doesn't support clustering of sparse input data.

Gilles Wed, 04 May 2016 08:18:28 -0700

Hello.

On Sun, 1 May 2016 02:57:59 +0300, Artem Barger wrote:

Hi,
On Sun, May 1, 2016 at 12:25 AM, Gilles<gil...@harfang.homelinux.org>
wrote:

Therefore I started to wonder why not to use RealVector
instead, since it has sparse implementation so I will be able toleverage
it.
The principle is fine; but I'm wary to use "RealVector" in new code
since it must be refactored...
By refactoring, do you mean MATH-765?


Yes.

Right now using kmeans++ clustering algorithm provided bycommon.maths
it's not doable to cluster entire wikipedia dataset or any otherhuge
datasets.
Could you expand on this application?  What is the data?
The data is tf-idf matrix of wikipedia data set produced by gensim.I'm
trying to run coreset algorithm with Spark streaming
and I need kmeans++ for local computation.
[One of the things to bear in mind while testing new implementations
is to not loose performance on the other classes of problems (oryou'll
take heat from some of this list's observers...).]
Is there any performance tests or regression, so it could be used to
validate whenever new implementation doesn't introduced
significant regression?


Not really.

There is a class "PertTestUtils" (in the "test" part of the source code
repository) that could compare execution time between alternative
implementations of some functionality.

I think that it's reliable enough to detect significant performanceissues.


Otherwise, there is the (now "standard"?) JMH benchmarking tool.

Regards,
Gilles

On Sat, Apr 30, 2016 at 11:41 PM, Gilles<gil...@harfang.homelinux.org>
wrote:

On Mon, 25 Apr 2016 15:52:03 +0300, Artem Barger wrote:
Hi All,
I'd like to provide a solution for [MATH-1330] issue. Beforestarting I
have a concerns regarding the possible design and the actual
implementation.
Currently all implementations of Clusterer interface expect toreceiveinstance of DistanceMeasure class, which used to compute distanceor
metric
between two points. Switching clustering algorithms to work withVectorswill make this unnecessary, therefore there will be no need toprovideDistanceMeasure, since Vector class already provides methods tocompute
vector norms.
I think that reasons for using "double[]" in the"o.a.c.m.ml.clustering"
package were:
* simple and straightforward (fixed dimension Cartesiancoordinates)* not couple it with the "o.a.c.m.linear" package whose"RealVector" isfor variable size sequences of elements (and is also,inconsistently,used as a Cartesian vector, and also as column- androw-matrix[1])
It is arguable adapted for a family of problems which thedeveloper
probably had in mind when taking those design decisions.
It would be interesting to know for which class of problems, thedesign
is inappropriate, in order to clarify ideas.
The main drawback of this approach is that we will loose theability to
control which metric to use during clustering, however the onlyclasses
which make an implicit use of this parameters are: Clusterer and
KmeansPlusPlusClusterer all others assumes EucledianDistance bydefault.
There is a default indeed, but all "Clusterer" implementations use
whatever "DistanceMeasure" has been passed to the constructor.
Assuming that "RealVector" knows how to compute the distance meansthatusers will have to implement their own subclass of "RealVector"and
override "getDistance(RealVector)" if they want another distance.
Alternatively, CM would have to define all these classes.

At first sight, it does not seem the right way to go...
One of the possible approaches is to extend DistanceMeasureinterface to
be
able to compute distance between two vectors? After all it's onlysub
first
vector from the second and compute desired norm on the result.
Seems good (at first sight) but (IMHO) only if we implement a new
"CartesianVector" class unencumbered with all the cruft of"RealVector".
Another possible solution is to make vector to return it'scoordinates,
hence it avail us to use DistanceMeasure as is. Personally I donot
think
this is good approach, since it will make no sense with sparsevectors.
Ruled out indeed if it conflicts with your intended usage.
Last alternative this comes to my mind is to create a set of enumsto
indicate which vector norm to use to compute distances, also dono thinkthis is very good solution, since sounds too intrusive and mightbreak
backward compatibility.
And forward compatibility (clustering code will have to be adaptedif
another distance is added later).
What do you think? Am I missing something? Is there a betterpossible way
to achieve the goal?
As indicated above, a practical example might help visualize theoptions.
Regards,
Gilles

[1] Cf. https://issues.apache.org/jira/browse/MATH-765


Best regards,
                      Artem Barger.



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [math]: [MATH-1330] - KMeans clustering algorithm, doesn't support clustering of sparse input data.

Reply via email to