Greetings, commons-math! I've been looking at a variety of apache/bsd-licensed linear libraries for use in massively parallel machine-learning applications I've been working on (I am housing my own open-source library at http://decomposer.googlecode.com, and am looking at integrating with/using/contributing to Apache Mahout), and I'm wondering a little about the linear API there is here in commons-math:
* also for RealVector - No iterator methods? So if the implementation is sparse, there's no way to just iterate over the non-zero entries? What's worse, you can't even subclass OpenMapVector and expose the iterator on the OpenIntToDoubleHashMap inner object, because it's private. :\ * for RealVector - what's with the million-different methods mapXXX(), mapXXXtoSelf()? Why not just map(UnaryFunction()), and mapToSelf(UnaryFunction()), where UnaryFunction defines the single method double apply(double d); ? Any user who wishes to implement RealVector (to say, make a more efficient specialized SparseVector) has to go through the pain of writing up a million methods dealing with these (and even if copy/paste gets most of this, it still leads to some horribly huge .java files filled with junk that does not appear to be used). There does not even appear to be an AbstractRealVector which implements all of these for you (by using the above-mentioned iterator() ). * while we're at it, if there is map(), why not also double RealVector.collect(Collector()), where Collector defines void collect(int index, double value); and double result(); - this can be used for generic inner products and kernels (and can allow for consolidating all of the L1Norm(), norm(), and LInfNorm() methods into this same method, passing in different L1NormCollector() etc... instances). * why all the methods which are overloaded to take either RealVector or double[] (getDistance, dotProduct, add, etc...) - is there really that much overhead in just implementing dotProduct(double[] d) as just dotProduct(new ArrayRealVector(d, false)); - no copy is done, nothing is done but one object creation... * SparseVector is just a marker interface? Does it serve any purpose? I guess I could ask similar questions on the Matrix interfaces, but maybe those will probably be cleared up by understanding the philosophy behind the Vector interfaces. I'd love to use commons-math for parts of my projects in which the entire data sets can live in memory (often part of the computation falls into this category, even if it's not the most meaty part, it's big enough that I'll kill my performance if I am stuck writing my own subroutines for eigen computation, etc for many moderately small matrices), but converting two and from the commons-math linear interfaces seem a bit unweildy. Maybe it would be easier if I could understand why these are the way they are. I'm happy to contribute patches consolidating interfaces and/or extending functionality (you seem to be missing a compact int/double pair implementation of sparse vectors, for example, which are a fantasticly performant format if they're immutable and only being used for dot products and adding them to dense vectors), if it would be of help (I'm tracking my attempts at this over on my GitHub clone of trunk: http://github.com/jakemannix/commons-math ). -jake mannix Principal Software Engineer Search and Recommender Systems LinkedIn.com