I would like to add my voice as a Mahout committer. We would LOVE to use commons math in Mahout, but these and a few other issues prevent it.
There was word some time ago about integrating a high performance linear package such as MTJ into math. Is that stalled? On Tue, Oct 13, 2009 at 10:50 PM, Jake Mannix <jake.man...@gmail.com> wrote: > Greetings, commons-math! > > I've been looking at a variety of apache/bsd-licensed linear libraries for > use in massively parallel machine-learning applications I've been working > on > (I am housing my own open-source library at > http://decomposer.googlecode.com, > and am looking at integrating with/using/contributing to Apache Mahout), > and > I'm wondering a little about the linear API there is here in commons-math: > > * also for RealVector - No iterator methods? So if the implementation is > sparse, there's no way to just iterate over the non-zero entries? What's > worse, you can't even subclass OpenMapVector and expose the iterator on the > OpenIntToDoubleHashMap inner object, because it's private. :\ > > * for RealVector - what's with the million-different methods mapXXX(), > mapXXXtoSelf()? Why not just map(UnaryFunction()), and > mapToSelf(UnaryFunction()), where UnaryFunction defines the single method > double apply(double d); ? Any user who wishes to implement RealVector (to > say, make a more efficient specialized SparseVector) has to go through the > pain of writing up a million methods dealing with these (and even if > copy/paste gets most of this, it still leads to some horribly huge .java > files filled with junk that does not appear to be used). There does not > even appear to be an AbstractRealVector which implements all of these for > you (by using the above-mentioned iterator() ). > > * while we're at it, if there is map(), why not also double > RealVector.collect(Collector()), where Collector defines void collect(int > index, double value); and double result(); - this can be used for generic > inner products and kernels (and can allow for consolidating all of the > L1Norm(), norm(), and LInfNorm() methods into this same method, passing in > different L1NormCollector() etc... instances). > > * why all the methods which are overloaded to take either RealVector or > double[] (getDistance, dotProduct, add, etc...) - is there really that much > overhead in just implementing dotProduct(double[] d) as just > dotProduct(new > ArrayRealVector(d, false)); - no copy is done, nothing is done but one > object creation... > > * SparseVector is just a marker interface? Does it serve any purpose? > > I guess I could ask similar questions on the Matrix interfaces, but maybe > those will probably be cleared up by understanding the philosophy behind > the > Vector interfaces. > > I'd love to use commons-math for parts of my projects in which the entire > data sets can live in memory (often part of the computation falls into this > category, even if it's not the most meaty part, it's big enough that I'll > kill my performance if I am stuck writing my own subroutines for eigen > computation, etc for many moderately small matrices), but converting two > and > from the commons-math linear interfaces seem a bit unweildy. Maybe it > would > be easier if I could understand why these are the way they are. > > I'm happy to contribute patches consolidating interfaces and/or extending > functionality (you seem to be missing a compact int/double pair > implementation of sparse vectors, for example, which are a fantasticly > performant format if they're immutable and only being used for dot products > and adding them to dense vectors), if it would be of help (I'm tracking my > attempts at this over on my GitHub clone of trunk: > http://github.com/jakemannix/commons-math ). > > -jake mannix > Principal Software Engineer > Search and Recommender Systems > LinkedIn.com > -- Ted Dunning, CTO DeepDyve