Greetings, commons-math!

  I've been looking at a variety of apache/bsd-licensed linear libraries for
use in massively parallel machine-learning applications I've been working on
(I am housing my own open-source library at http://decomposer.googlecode.com,
and am looking at integrating with/using/contributing to Apache Mahout), and
I'm wondering a little about the linear API there is here in commons-math:

  * also for RealVector - No iterator methods?  So if the implementation is
sparse, there's no way to just iterate over the non-zero entries?  What's
worse, you can't even subclass OpenMapVector and expose the iterator on the
OpenIntToDoubleHashMap inner object, because it's private. :\

  * for RealVector - what's with the million-different methods mapXXX(),
mapXXXtoSelf()?  Why not just map(UnaryFunction()), and
mapToSelf(UnaryFunction()), where UnaryFunction defines the single method
double apply(double d); ?  Any user who wishes to implement RealVector (to
say, make a more efficient specialized SparseVector) has to go through the
pain of writing up a million methods dealing with these (and even if
copy/paste gets most of this,  it still leads to some horribly huge .java
files filled with junk that does not appear to be used).  There does not
even appear to be an AbstractRealVector which implements all of these for
you (by using the above-mentioned iterator() ).

  * while we're at it, if there is map(), why not also double
RealVector.collect(Collector()), where Collector defines void collect(int
index, double value); and double result(); - this can be used for generic
inner products and kernels (and can allow for consolidating all of the
L1Norm(), norm(), and LInfNorm() methods into this same method, passing in
different L1NormCollector() etc... instances).

  * why all the methods which are overloaded to take either RealVector or
double[] (getDistance, dotProduct, add, etc...) - is there really that much
overhead in just implementing dotProduct(double[] d)  as just dotProduct(new
ArrayRealVector(d, false)); - no copy is done, nothing is done but one
object creation...

  * SparseVector is just a marker interface?  Does it serve any purpose?

I guess I could ask similar questions on the Matrix interfaces, but maybe
those will probably be cleared up by understanding the philosophy behind the
Vector interfaces.

I'd love to use commons-math for parts of my projects in which the entire
data sets can live in memory (often part of the computation falls into this
category, even if it's not the most meaty part, it's big enough that I'll
kill my performance if I am stuck writing my own subroutines for eigen
computation, etc for many moderately small matrices), but converting two and
from the commons-math linear interfaces seem a bit unweildy.  Maybe it would
be easier if I could understand why these are the way they are.

I'm happy to contribute patches consolidating interfaces and/or extending
functionality (you seem to be missing a compact int/double pair
implementation of sparse vectors, for example, which are a fantasticly
performant format if they're immutable and only being used for dot products
and adding them to dense vectors), if it would be of help (I'm tracking my
attempts at this over on my GitHub clone of trunk:
http://github.com/jakemannix/commons-math ).

  -jake mannix
  Principal Software Engineer
  Search and Recommender Systems
  LinkedIn.com

Reply via email to