On Mon, 25 Apr 2016 15:52:03 +0300, Artem Barger wrote:
Hi All,

I'd like to provide a solution for [MATH-1330] issue. Before starting I have a concerns regarding the possible design and the actual implementation.

Currently all implementations of Clusterer interface expect to receive instance of DistanceMeasure class, which used to compute distance or metric between two points. Switching clustering algorithms to work with Vectors will make this unnecessary, therefore there will be no need to provide DistanceMeasure, since Vector class already provides methods to compute
vector norms.

I think that reasons for using "double[]" in the "o.a.c.m.ml.clustering"
package were:
 * simple and straightforward (fixed dimension Cartesian coordinates)
* not couple it with the "o.a.c.m.linear" package whose "RealVector" is for variable size sequences of elements (and is also, inconsistently,
   used as a Cartesian vector, and also as column- and row-matrix[1])

It is arguable adapted for a family of problems which the developer
probably had in mind when taking those design decisions.

It would be interesting to know for which class of problems, the design
is inappropriate, in order to clarify ideas.

The main drawback of this approach is that we will loose the ability to control which metric to use during clustering, however the only classes
which make an implicit use of this parameters are: Clusterer and
KmeansPlusPlusClusterer all others assumes EucledianDistance by default.

There is a default indeed, but all "Clusterer" implementations use
whatever "DistanceMeasure" has been passed to the constructor.

Assuming that "RealVector" knows how to compute the distance means that
users will have to implement their own subclass of "RealVector" and
override "getDistance(RealVector)" if they want another distance.
Alternatively, CM would have to define all these classes.

At first sight, it does not seem the right way to go...

One of the possible approaches is to extend DistanceMeasure interface to be able to compute distance between two vectors? After all it's only sub first
vector from the second and compute desired norm on the result.

Seems good (at first sight) but (IMHO) only if we implement a new
"CartesianVector" class unencumbered with all the cruft of "RealVector".

Another possible solution is to make vector to return it's coordinates, hence it avail us to use DistanceMeasure as is. Personally I do not think this is good approach, since it will make no sense with sparse vectors.

Ruled out indeed if it conflicts with your intended usage.

Last alternative this comes to my mind is to create a set of enums to
indicate which vector norm to use to compute distances, also do no think this is very good solution, since sounds too intrusive and might break
backward compatibility.

And forward compatibility (clustering code will have to be adapted if
another distance is added later).

What do you think? Am I missing something? Is there a better possible way
to achieve the goal?

As indicated above, a practical example might help visualize the options.


Regards,
Gilles

[1] Cf. https://issues.apache.org/jira/browse/MATH-765


Best regards,
                      Artem Barger.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Reply via email to